### Browsing by Author "Lubbe, Sugnet"

Now showing 1 - 9 of 9

###### Results Per Page

###### Sort Options

- ItemOpen AccessBiplots based on principal surfaces(2019) Ganey, Raeesa; Er, Sebnem; Lubbe, SugnetPrincipal surfaces are smooth two-dimensional surfaces that pass through the middle of a p-dimensional data set. They minimise the distance from the data points, and provide a nonlinear summary of the data. The surfaces are nonparametric and their shape is suggested by the data. The formation of a surface is found using an iterative procedure which starts with a linear summary, typically with a principal component plane. Each successive iteration is a local average of the p-dimensional points, where an average is based on a projection of a point onto the nonlinear surface of the previous iteration. Biplots are considered as extensions of the ordinary scatterplot by providing for more than three variables. When the difference between data points are measured using a Euclidean embeddable dissimilarity function, observations and the associated variables can be displayed on a nonlinear biplot. A nonlinear biplot is predictive if information on variables is added in such a way that it allows the values of the variables to be estimated for points in the biplot. Prediction trajectories, which tend to be nonlinear are created on the biplot to allow information about variables to be estimated. The goal is to extend the idea of nonlinear biplot methodology onto principal surfaces. The ultimate emphasis is on high dimensional data where the nonlinear biplot based on a principal surface allows for visualisation of samples, variable trajectories and predictive sets of contour lines. The proposed biplot provides more accurate predictions, with an additional feature of visualising the extent of nonlinearity that exists in the data.
- ItemOpen AccessThe construction of a partial least squares biplot(2014) Oyedele, Opeoluwa Funmilayo; Lubbe, SugnetIn multivariate analysis, data matrices are often very large, which sometimes makes it difficult to describe their structure and to make a visual inspection of the relationship between their respective rows (samples) and columns (variables). For this reason, biplots, the joint graphical display of the rows and columns of a data matrix, can be useful tools for analysis. Since they were first introduced, biplots have been employed in a number of multivariate methods, such as Correspondence Analysis (CA), Principal Component Analysis (PCA), Canonical Variate Analysis (CVA) and Discriminant Analysis (DA), as a form of graphical display of data. Another possible employment is in Partial Least Squares (PLS). First introduced as a regression method, PLS is more flexible than multivariate regression, but better suited than Principal Component Regression (PCR) for the prediction of a set of response variables from a large set of predictor variables. Employing the biplot in PLS gave rise to the PLS biplot, a new addition to the biplot family. In the current study, this biplot was successfully applied to the sensory data to investigate the relationships between the sensory panel characteristics and the chemical quality measurements of sixteen olive oils. It was also applied to a large set of mineral sorting production data to investigate the relationships between the output variables and the process factors used to produce a final product. Furthermore, the PLS biplot was applied to a Binomialdistributed data concerning the diabetes testing of Indian women and to a Poisson-distributed data showing the diversity of arboreal marsupials (possum) in the Montane ash forest. After these applications, it is proposed that the PLS biplot is a useful graphical tool for displaying results from the (univariate) Partial Least Squares-Generalized Linear Model (PLS-GLM) analysis of a data set. With Partial Least Squares Regression (PLSR) being a valuable method for modelling high-dimensional data, especially in chemometrics, the PLS biplot was successfully applied to a cereal evaluation containing one hundred and forty five infrared spectra and six chemical properties, and a gene expression data with two thousand genes.
- ItemOpen AccessFunctional linear regression on Namibian and South African data(2016) Mzimela, Nosipho; Lubbe, SugnetIndigenous to Southern Africa, the Aloe Dichotoma, most commonly known as the Quiver tree, are species of Aloe found mostly in the southern parts of Namibia and the Northern Cape Province in South Africa. Researchers noticed that Quiver trees assumed very different shapes depending on their geographical location. This project aims to model the observed differences in structural form of the trees between geographically spate populations with functional regression analysis using climate variables at each location. A number of statistical challenges present themselves such as the multivariate nature of the data. Functional data analysis was used in this project to display the data so as to highlight various characteristics while allowing us to study important sources of pattern and variation among the data. Functional data analysis can be best summarised as approximating discrete data with a function by assuming the existence of a function giving rise to the observed data. The underlying function is assumed to be smooth such that a pair of adjacent data values are necessarily linked together and unlikely to be too different from each other. There are a number of smoothing methods used to fit a function to the discrete data. In this project we use Roughness Penalty Smoothing methods which are based on optimising a fitting criterion that defines what a smooth of the data is trying to achieve. The meaning of smooth is explicitly expressed at the level of the criterion being optimised, rather than implicitly in terms of the number of basis functions used. Once the continuous functions for the climate variables have been fitted, these are used as predictors in a functional regression model with the structural variables as responses. This allows for the estimation of regression coefficients to describe the effect of the climate variables on each structural variable. The functional models suggest that maximum temperature has an effect on the structural form of Aloe Dichotoma. Further, the structural form of Aloe Dichotoma does differ in geographically spate locations. Trees found in the warmer Northern regions are more likely to have taller trees. The results did not necessarily prove the hypothesis that the number of branches found on trees in the North is fewer than those in the South, but these trees are more likely to have more dichotomous branches which may be translated to more branches.
- ItemOpen AccessHedge fund of funds investment process : a South African perspective(2014) Hossain, Mahzabeen Natasha; Lubbe, SugnetThe objective of this dissertation is to develop and test an investment process for hedge fund of funds (HFoFs) in South Africa. The dissertation proposes a three tiered process, adapted from the works of Lo (2008). Step one of the proccess involves the categorisation of hedge funds into broadly defined groups based on predefined factors. Two classification methodologies are examined herein to determine optimal category definitions. These are 1) an adaption of the classification developed by Schneeweis and Spurgin (2000), based on the correlation of hedge funds to an appropriate benchmark and the returns offered by these hedge funds, and 2) classification by cluster analysis. Once a finite set of classification is defined, step two of the process uses a minimum variance optimisation, based on forward-looking parameter estimates of return and co-variance to compute the optimal capital allocation to these categories. The final stage of the process employs a mixture of quantitative and qualitative analysis to allocate capital within categories to individual hedge funds.
- ItemOpen AccessAn investigation into Functional Linear Regression Modeling(2015) Essomba, Rene Franck; Lubbe, SugnetFunctional data analysis, commonly known as FDA", refers to the analysis of information on curves of functions. Key aspects of FDA include the choice of smoothing techniques, data reduction, model evaluation, functional linear modeling and forecasting methods. FDA is applicable in numerous applications such as Bioscience, Geology, Psychology, Sports Science, Econometrics, Meteorology, etc. This dissertation main objective is to focus more specifically on Functional Linear Regression Modelling (FLRM), which is an extension of Multivariate Linear Regression Modeling. The problem of constructing a Functional Linear Regression modelling with functional predictors and functional response variable is considered in great details. Discretely observed data for each variable involved in the modelling are expressed as smooth functions using: Fourier Basis, B-Splines Basis and Gaussian Basis. The Functional Linear Regression Model is estimated by the Least Square method, Maximum Likelihood method and more thoroughly by Penalized Maximum Likelihood method. A central issue when modelling Functional Regression models is the choice of a suitable model criterion as well as the number of basis functions and an appropriate smoothing parameter. Four different types of model criteria are reviewed: the Generalized Cross-Validation, the Generalized Information Criterion, the modified Akaike Information Criterion and Generalized Bayesian Information Criterion. Each of these aforementioned methods are applied to a dataset and contrasted based on their respective results.
- ItemOpen AccessAn investigation of Multidimensional Scaling with an emphasis on the development of an R based Graphical User Interface for performing Multidimensional Scaling procedures(2012) Timm, Andrew; Lubbe, SugnetThis dissertation is centered around the development of a graphical user interface, using the R statistical programing language, for performing Multidimensional Scaling. This program is called the MDS-GUI. Multidimensional Scaling (MDS) is one of the groups of multivariate analysis techniques that is used for dimension reduction. In general, these methods of MDS can be viewed as the problem of constructing a map when given a set of interpoint distances. The graphical configuration is produced, usually in two or three dimensions, in such a way that objects of the data are represented by points, where the Euclidean distances between them optimally represents the given set of observed distances. The MDS-GUI was developed using a combination of R and the scripting language tcltk. The primary objective of its design was to provide a comprehensive range of MDS methods and analytical tools that are accessed in a point and click manner. The target user group of the software is therefore widely spread as no coding and only a little expertise on MDS is required for its use. The capabilities of the MDS-GUI are demonstrated with the use of three data sets. The first is the well known and well used Morse-Code data; the second is a synthetic microarray based data set; and the third concerns the nutritional contents of a group of the cereals from the Kellog's company. The program will be the first complete MDS based GUI for the R-Environment, and will also be the package that provides access to the widest range of MDS methods in R.
- ItemOpen AccessPrincipal points, principal curves and principal surfaces(2015) Ganey, Raeesa; Lubbe, SugnetThe idea of approximating a distribution is a prominent problem in statistics. This dissertation explores the theory of principal points and principal curves as approximation methods to a distribution. Principal points of a distribution have been initially introduced by Flury (1990) who tackled the problem of optimal grouping in multivariate data. In essence, principal points are the theoretical counterparts of cluster means obtained by the k-means algorithm. Principal curves defined by Hastie (1984), are smooth one-dimensional curves that pass through the middle of a p-dimensional data set, providing a nonlinear summary of the data. In this dissertation, details on the usefulness of principal points and principal curves are reviewed. The application of principal points and principal curves are then extended beyond its original purpose to well-known computational methods like Support Vector Machines in machine learning.
- ItemOpen AccessRespiratory microbes present in the nasopharynx of children hospitalised with suspected pulmonary tuberculosis in Cape Town, South Africa(BioMed Central, 2016-10-24) Dube, Felix S; Kaba, Mamadou; Robberts, F J Lourens; Tow, Lemese A; Lubbe, Sugnet; Zar, Heather J; Nicol, Mark PBackground: Lower respiratory tract infection in children is increasingly thought to be polymicrobial in origin. Children with symptoms suggestive of pulmonary tuberculosis (PTB) may have tuberculosis, other respiratory tract infections or co-infection with Mycobacterium tuberculosis and other pathogens. We aimed to identify the presence of potential respiratory pathogens in nasopharyngeal (NP) samples from children with suspected PTB. Method: NP samples collected from consecutive children presenting with suspected PTB at Red Cross Children’s Hospital (Cape Town, South Africa) were tested by multiplex real-time RT-PCR. Mycobacterial liquid culture and Xpert MTB/RIF was performed on 2 induced sputa obtained from each participant. Children were categorised as definite-TB (culture or qPCR [Xpert MTB/RIF] confirmed), unlikely-TB (improvement of symptoms without TB treatment on follow-up) and unconfirmed-TB (all other children). Results: Amongst 214 children with a median age of 36 months (interquartile range, [IQR] 19–66 months), 34 (16 %) had definite-TB, 86 (40 %) had unconfirmed-TB and 94 (44 %) were classified as unlikely-TB. Moraxella catarrhalis (64 %), Streptococcus pneumoniae (42 %), Haemophilus influenzae spp (29 %) and Staphylococcus aureus (22 %) were the most common bacteria detected in NP samples. Other bacteria detected included Mycoplasma pneumoniae (9 %), Bordetella pertussis (7 %) and Chlamydophila pneumoniae (4 %). The most common viruses detected included metapneumovirus (19 %), rhinovirus (15 %), influenza virus C (9 %), adenovirus (7 %), cytomegalovirus (7 %) and coronavirus O43 (5.6 %). Both bacteria and viruses were detected in 73, 55 and 56 % of the definite, unconfirmed and unlikely-TB groups, respectively. There were no significant differences in the distribution of respiratory microbes between children with and without TB. Using quadratic discriminant analysis, human metapneumovirus, C. pneumoniae, coronavirus 043, influenza virus C virus, rhinovirus and cytomegalovirus best discriminated children with definite-TB from the other groups of children. Conclusions: A broad range of potential respiratory pathogens was detected in children with suspected TB. There was no clear association between TB categorisation and detection of a specific pathogen. Further work is needed to explore potential pathogen interactions and their role in the pathogenesis of PTB.
- ItemOpen AccessUnravelling black box machine learning methods using biplots(2019) Rowan, Adriaan; Little, Francesca; Lubbe, SugnetFollowing the development of new mathematical techniques, the improvement of computer processing power and the increased availability of possible explanatory variables, the financial services industry is moving toward the use of new machine learning methods, such as neural networks, and away from older methods such as generalised linear models. However, their use is currently limited because they are seen as “black box” models, which gives predictions without justifications and which are therefore not understood and cannot be trusted. The goal of this dissertation is to expand on the theory and use of biplots to visualise the impact of the various input factors on the output of the machine learning black box. Biplots are used because they give an optimal two-dimensional representation of the data set on which the machine learning model is based.The biplot allows every point on the biplot plane to be converted back to the original ��-dimensions – in the same format as is used by the machine learning model. This allows the output of the model to be represented by colour coding each point on the biplot plane according to the output of an independently calibrated machine learning model. The interaction of the changing prediction probabilities – represented by the coloured output – in relation to the data points and the variable axes and category level points represented on the biplot, allows the machine learning model to be globally and locally interpreted. By visualing the models and their predictions, this dissertation aims to remove the stigma of calling non-linear models “black box” models and encourage their wider application in the financial services industry.