Browsing by Subject "Mathematical Statistics"
Now showing 1 - 20 of 37
Results Per Page
Sort Options
- ItemOpen AccessThe address sort and other computer sorting techniques(1971) Underhill, Leslie G; Troskie, Casper GOriginally this project was to have been a feasibility study of the use of computers in the library. It soon became clear that the logical place in the library at which to start making use of the computer was the catalogue. Once the catalogue was in machine-readable form it would be possible to work backwards to the book ordering and acquisitions system and forwards to the circulation and book issue system. One of the big advantages in using the computer to produce the catalogue would be the elimination of the "skilled drudgery" of filing. Thus vast quantities of data would need to be sorted. And thus the scope of this project was narrowed down from a general feasibility study, firstly to a study of a particular section of the library and secondly to one particularly important aspect of that section - that of sorting with the aid of the computer. I have examined many, but by no means all computer sorting techniques, programmed them in FORTRAN as efficiently as I was able, and compared their performances on the IBM 1130 computer of the University of Cape Town. I have confined myself to internal sorts, i.e. sorts that take place in core. This thesis stops short of applying the best of these techniques to the library. I intend however to do so, and to work back to the original scope of my thesis.
- ItemOpen AccessAnalysis of clustered competing risks with application to a multicentre clinical trial(2016) Familusi, Mary Ajibola; Gumedze, Freedom NThe usefulness of time-to-event (survival) analysis has made it gain a wide applicability in statistically modelling research. The methodological developments of time-to-event analysis that have been widely adopted are: (i) The Kaplan-Meier method, for estimating the survival function; (ii) The log-rank test, for comparing the equality of two or more survival distributions; (m) The Cox proportional hazards model, for examining the covariate effects on the hazard function; and (iv) The accelerated failure time model, for examining the covariate effects on the survival function. Nonetheless, in time-to-event endpoints assessment, if subjects can fail from multiple mutually-exclusive causes, data are said to have competing risks. For competing risks data, the Fine and Gray proportional hazards model for sub-distributions has gained popularity due to its convenience in directly assessing the effect of covariates on the cumulative incidence function. Furthermore, sometimes competing risks data cannot be considered as independent because of a clustered design; for instance, in registry cohorts or multi-centre clinical trials. The Fine and Gray model has been extended to the analysis of clustered time-to-event data, by including random-centre effects or frailties in the sub-distribution hazard. This research focuses on the analysis of clustered competing risks with an application to the investigation of the management of pericarditis clinical trial (IMPI) dataset. IMPI is a multi- centre clinical trial that was carried out from 19 centres in 8 African countries with the principal objective of assessing the effectiveness and safety of adjunctive prednisolone and Mycobacterium indicus pranii immunotherapy, in reducing the composite outcome of death, constriction or cardiac tamponade, requiring pericardial drainage in patients with probable or definite tuberculous pericarditis. The clinical objective in this thesis is therefore to analyse time to these outcomes. In addition, the risk factors associated with these outcomes were determined, and the effect of the prednisolone and M. indcus pranii was examined, while adjusting for these risk factors and considering centres as a random effect. Using Cox proportional hazards model, it was found that age, weight, New York Heart Association (NYHA) class, hypotension, creatinine, and peripheral oedema show a statistically significant association with the composite outcome. Furthermore, weight, NYHA class, hypotension, creatinine and peripherial oedema show a statistically significant association with death. In addition, NYHA class and hypotension show a statistically significant association with cardiac tamponade. Lastly, prednisolone, gender, NYHA class, tachycardia, haemoglobin level, peripheral oedema, pulmonary infiltrate and HIV status show a statistically significant association with constriction. A value of 0.1 significance level was used to identify variables as significant in the univariate model using forward stepwise regression method. The random effect was found to be significant in the incidence of composite outcomes of death, cardiac tamponade and constriction, and in the individual outcome of constriction, but this only slightly changed the estimated effect of the covariates as compared to when the random effect was not considered. Accounting for death as a competing event to the outcomes of cardiac tamponade or constriction, does not affect the effect of the covariates on these outcomes. In addition, in the multivariate models that adjust for other risk factors, there was no significant difference in the primary outcome between patients who received prednisolone, and those who received placebo, or between those who received M. indicus pranii immunotherapy, and those who received placebo.
- ItemOpen AccessAnalysis of distribution maps from bird atlas data: dissimilarities between species, continuity within ranges and smoothing of distribution maps(1998) Erni, Birgit; Underhill, LesA dissimilarity coefficient for estimating the dissimilarity between two bird atlas distributions is developed. This coefficient is based on the Euclidean distance concept. The atlas distributions are compared over all quarter degree grid cells. Existing coefficients are not suitable for the comparison of distributions with different total areas and species with different mean reporting rates. In each grid cell the reliability of the reporting rates depends on the number of checklists collected for the grid cell. Weights are used to solve this problem. To solve the problem of different levels of abundance and conspicuousness of species, the reporting rates are sorted into percentiles, using five or 10 categories for the strictly positive reporting rates. Each grid cell is weighted by a function of the number of checklists collected for the grid cell. The coefficient is scaled by the maximum possible sum of the differences which would occur if there is no overlap between the two distributions, so that the dissimilarity coefficient lies between zero (a perfect match) and one (no overlap). A variety of these coefficients are investigated and compared. The continuity of observed reporting rates in a spatial cellular map is an indication of spatial autocorrelation present, especially between observations which are in close vicinity. We are particularly interested in measuring and comparing the continuity of the reporting rates in the bird distributions from The Atlas of Southern African Birds. The variogram, developed in geostatistics, estimates this spatial autocorrelation. The classical variogram estimator, however, is dependent on the scale of measurement and assumes that the data are intrinsically stationary. The bird atlas distribution maps contain trend and the variance of each observation (reporting rate) is a function of the number of checklists collected for the grid cell and the underlying probability of encountering the species in the grid cell. The approach of removing this binomial measurement error from the variogram developed by McNeill (1991) is investigated but not found satisfactory. A weighted variogram, where each squared difference is weighted by a function of the smaller number of checklists, is developed. To make the variogram values comparable between species a function of the mean reporting rates is used to scale the variogram. We were particularly interested in the first variogram value of each species distribution, 2y(1). The bird distribution maps in The Atlas of Southern African Birds show the raw observed reporting rates. Each of these reporting rates is a random variable dependent on sampling error due to binomial variation based on the number of checklists collected for the grid cell and on the underlying probability of encountering the species. The distribution maps show this measurement error. It is believed that a smoothed version of the bird distribution maps will to some extent improve the statement these observed distributions are aiming to make. Single-step regression methods are investigated for a fast approach to this problem. These cause problems because of frequent 'zero' observed reporting rates and because they smooth the maps too heavily. Generalized Linear Models are investigated and this iterative procedure is applied to model the reporting rates with a binomial distribution on square blocks of nine grid cells where a value for the central cell is 'predicted' in each regression. This approach is especially suited to accommodate the binomial distribution characteristics and is found to smooth the bird atlas distributions well. Because only a local window is taken for each regression, the spatial autocorrelation is adequately included in the spatial explanatory variables.
- ItemOpen AccessAspects of multivariate complex quadratic forms(1981) Conradie, Willem Jacobus; Troskie, Casper GIn this study the distributional properties of certain multivariate complex quadratic forms and their characteristic roots are investigated. Multivariate complex distribution theory was originally introduced by Wooding (1956), Turin (1960) and Goodman (1963a) when they derived and studied the multivariate complex normal distribution. The multivariate complex normal distribution is the basis of complex distribution theory and plays an important role in various areas. In the area of multiple time-series the complex distribution theory is found very useful.
- ItemOpen AccessAspects of non-central multivariate t distributions(1973) Juritz, June M; Troskie, Casper G
- ItemOpen AccessA comparison of methods for analysing interval-censored and truncated survival data(2004) Davidse, Alistair; Juritz, JuneThis thesis examines three methods for analysing right-censored data: the Cox proportional hazards model (Cox, 1972), the Buckley-James regression model (Buckley and James, 1979) and the accelerated failure time model. These models are extended to incorporate the analysis of interval-censored and left-truncated data. The models are compared in an attempt to determine whether one model performs better than the others in terms of goodness-of-it and in terms of predictive power. Plots of the residuals and random effects from the Cox proportional hazards model are also examined.
- ItemOpen AccessA contribution to adaptive robust estimation(1981) Barr, Graham Douglas Irving; Money, A H; Affleck-Graves , J F; Hart, M L;This study initially set out to consider the possibility of constructing an adaptive robust estimation procedure for the standard linear regression model when the disturbance vector deviated from normality, however, after the initial success in that field it seemed only appropriate that the approach be extended to robust location parameter estimation. This is a particular case of the regression model and an area in which a number of different estimators have been proposed and a great deal of comparative research work done. Due to the wider scope of such research the greater part of the thesis is devoted to this field of research which led to many interesting and useful results and conclusions.
- ItemOpen AccessA contribution to the solving of non-linear estimation problems(1984) Gonin, René; Money, A H
- ItemOpen AccessContributions to the theory of generalized inverses, the linear model and outliers(1982) Dunne, Timothy Terence; Troskie, Casper GColumn-space conditions are shown to be at the heart of a number of identities linking generalized inverses of rectangular matrices. These identities give some new insights into reparametrizations of the general linear model, and into the imposition of constraints, when the variance-covariance structure is σ².I. Hypothesis-test statistics for non-estimable functions are shown to give no further information than underlying estimable functions. For an arbitrary variance-covariance structure the "sweep-out" method is generalized. The John and Draper model for outliers is extended, and distributional results established. Some diagnostic statistics for outlying or influential observations are considered. A Bayesian formulation of outliers in the general linear model is attempted.
- ItemOpen AccessDiscriminant analysis : a review of its application to the classificationof grape cultivars(1989) Blignaut, Rennette Julia; Zucchini, Walter; Stewart, Theodor JThe aim of this study was to calculate a classification function for discriminating between five grape cultivars with a view to determine the cultivar of an unknown grape juice. In order to discriminate between the five grape cultivars various multivariate statistical techniques, such as principal component analysis, cluster analysis, correspondence analysis and discriminant analysis were applied. Discriminant analysis resulted in the most appropriate technique for the problem at hand and therefore an in depth study of this technique was undertaken. Discriminant analysis was the most appropriate technique for classifying these grape samples into distinct cultivars because this technique utilized prior information of population membership. This thesis is divided into two main sections. The first section (chapters 1 to 5) is a review on discriminant analysis, describing various aspects of this technique and matters related thereto. In the second section (chapter 6) the theories discussed in the first section are applied to the problem at hand. The results obtained when discriminating between the different grape cultivars are given. Chapter 1 gives a general introduction to the subject of discriminant analysis, including certain basic derivations used in this study. Two approaches to discriminant analysis are discussed in Chapter 2, namely the parametrical and non-parametrical approaches. In this review the emphasis is placed on the classical approach to discriminant analysis. Non-parametrical approaches such as the K-nearest neighbour technique, the kernel method and ranking are briefly discussed. Chapter 3 deals with estimating the probability of misclassification. In Chapter 4 variable selection techniques are discussed. Chapter 5 briefly deals with sequential and logistical discrimination techniques. The estimation of missing values is also discussed in this chapter. A final summary and conclusion is given in Chapter 7. Appendices A to D illustrate some of the obtained results from the practical analyses.
- ItemOpen AccessDistributions of certain test statistics in multivariate regression(1980) Coutsourides, Dimitris; Troskie, Casper GThis thesis is principally concerned with test criteria for testing different hypotheses for the multivariate regression. In this preface a brief summary of each of the succeeding chapters is given. In Chapter 1 the problem of testing the equality of two population multiple correlation coefficients in identical regression experiments has been studied. The author's results are extentions to those of Schuman and Bradley. In Chapter 2 the results of Chapter 1 are extended to the multivariate case, in other words, the author has constructed tests in order to test the equality of two population generalized multiple correlation matrices. In Chapter 3 the author shows that the Ridge Regression, Principal Components and Shrunken estimators yield the same central t and F statistics as the ordinary least square estimator. In Chapter 4 using the results of Aitken, simultaneous tests for the Cp-criterion of Mallows are constructed. Some comments on extrapolation and prediction are made. In Chapter 5 the Ridge and Principal components residuals are studied. Their use for detecting outliers, when multi-collinearity is present, is examined.
- ItemOpen AccessEmpirical statistical modelling for crop yields predictions: bayesian and uncertainty approaches(2015) Adeyemi, Rasheed Alani; Guo, Renkuan; Dunne, TimThis thesis explores uncertainty statistics to model agricultural crop yields, in a situation where there are neither sampling observations nor historical record. The Bayesian approach to a linear regression model is useful for predict ion of crop yield when there are quantity data issue s and the model structure uncertainty and the regression model involves a large number of explanatory variables. Data quantity issues might occur when a farmer is cultivating a new crop variety, moving to a new farming location or when introducing a new farming technology, where the situation may warrant a change in the current farming practice. The first part of this thesis involved the collection of data from experts' domain and the elicitation of the probability distributions. Uncertainty statistics, the foundation of uncertainty theory and the data gathering procedures were discussed in detail. We proposed an estimation procedure for the estimation of uncertainty distributions. The procedure was then implemented on agricultural data to fit some uncertainty distributions to five cereal crop yields. A Delphi method was introduced and used to fit uncertainty distributions for multiple experts' data of sesame seed yield. The thesis defined an uncertainty distance and derived a distance for a difference between two uncertainty distributions. We lastly estimated the distance between a hypothesized distribution and an uncertainty normal distribution. Although, the applicability of uncertainty statistics is limited to one sample model, the approach provides a fast approach to establish a standard for process parameters. Where no sampling observation exists or it is very expensive to acquire, the approach provides an opportunity to engage experts and come up with a model for guiding decision making. In the second part, we fitted a full dataset obtained from an agricultural survey of small-scale farmers to a linear regression model using direct Markov Chain Monte Carlo (MCMC), Bayesian estimation (with uniform prior) and maximum likelihood estimation (MLE) method. The results obtained from the three procedures yielded similar mean estimates, but the credible intervals were found to be narrower in Bayesian estimates than confidence intervals in MLE method. The predictive outcome of the estimated model was then assessed using simulated data for a set of covariates. Furthermore, the dataset was then randomly split into two data sets. The informative prior was later estimated from one-half called the "old data" using Ordinary Least Squares (OLS) method. Three models were then fitted onto the second half called the "new data": General Linear Model (GLM) (M1), Bayesian model with a non-informative prior (M2) and Bayesian model with informative prior (M3). A leave-one-outcross validation (LOOCV) method was used to compare the predictive performance of these models. It was found that the Bayesian models showed better predictive performance than M1. M3 (with a prior) had moderate average Cross Validation (CV) error and Cross Validation (CV) standard error. GLM performed worst with least average CV error and highest (CV) standard error among the models. In Model M3 (expert prior), the predictor variables were found to be significant at 95% credible intervals. In contrast, most variables were not significant under models M1 and M2. Also, The model with informative prior had narrower credible intervals compared to the non-information prior and GLM model. The results indicated that variability and uncertainty in the data was reasonably reduced due to the incorporation of expert prior / information prior. We lastly investigated the residual plots of these models to assess their prediction performance. Bayesian Model Average (BMA) was later introduced to address the issue of model structure uncertainty of a single model. BMA allows the computation of weighted average over possible model combinations of predictors. An approximate AIC weight was then proposed for model selection instead of frequentist alternative hypothesis testing (or models comparison in a set of competing candidate models). The method is flexible and easy to interpret instead of raw AIC or Bayesian information criterion (BIC), which approximates the Bayes factor. Zellner's g-prior was considered appropriate as it has widely been used in linear models. It preserves the correlation structure among predictors in its prior covariance. The method also yields closed-form marginal likelihoods which lead to huge computational savings by avoiding sampling in the parameter space as in BMA. We lastly determined a single optimal model from all possible combination of models and also computed the log-likelihood of each model.
- ItemOpen AccessExact powers of some multivariate test criteria(1974) Hart, Michael Lester; Money, A HIn this thesis an algorithm for the noncentral linear density and cumulative distribution function of Wilks' likelihood ratio criterion in MANOVA is derived and it is shown how this algorithm, with modifications, can be used to find the distributions of a number of test criteria for different hypotheses. At the same time previous results regarding percentiles and powers of these criteria are examined and discussed.
- ItemOpen AccessAn examination of heuristic algorithms for the travelling salesman problem(1988) Höck, Barbar Katja; Stewart, Theodor JThe role of heuristics in combinatorial optimization is discussed. Published heuristics for the Travelling Salesman Problem (TSP) were reviewed and morphological boxes were used to develop new heuristics for the TSP. New and published heuristics were programmed for symmetric TSPs where the triangle inequality holds, and were tested on micro computer. The best of the quickest heuristics was the furthest insertion heuristic, finding tours 3 to 9% above the best known solutions (2 minutes for 100 nodes). Better results were found by longer running heuristics, e.g. the cheapest angle heuristic (CCAO), 0-6% above best (80 minutes for 100 nodes). The savings heuristic found the best results overall, but took more than 2 hours to complete. Of the new heuristics, the MST path algorithm at times improved on the results of the furthest insertion heuristic while taking the same time as the CCAO. The study indicated that there is little likelihood of improving on present methods unless a fundamental new approach is discovered. Finally a case study using TSP heuristics to aid the planning of grid surveys was described.
- ItemOpen AccessHigh-frequency correlation dynamics: Is the Epps effect a bias?(2021) Chang, Patrick; Gebbie, Timothy; Pienaar, EtienneWe tackle the question of whether Trade and Quote data from high-frequency finance are representative of discrete connected events, or whether these measurements can still be faithfully represented as random samples of some underlying Brownian diffusion in the context of modelling correlation dynamics. In particular, if the implicit notion of instantaneous correlation dynamics that are independent of the time-scale a reasonable assumption. To this end, we apply kernel averaging non-uniform fast Fourier transforms in the context of the Malliavin-Mancino integrated and instantaneous volatility estimators to speed up the estimators. We demonstrate the implicit time-scale investigated by the estimator by comparing it to the theoretical Epps effect arising from asynchrony. We compare the Malliavin-Mancino and Cuchiero-Teichmann Fourier instantaneous estimators and demonstrate the relationship between the instantaneous Epps effect and the cutting frequencies in the Fourier estimators. We find that using the previous tick interpolation in the Cuchiero-Teichmann estimator results in unstable estimates when dealing with asynchrony, while the ability to bypass the time domain with the Malliavin-Mancino estimator allows it to produce stable estimates and is therefore better suited for ultra high-frequency finance. We derive the Epps effect arising from asynchrony and provide a refined approach to correct the effect. We compare methods to correct for the Epps effect arising from asynchrony when the underlying process is a Brownian diffusion, and when the underlying process is from discrete connected events (proxied using a D-type Hawkes process). We design three experiments using the Epps effect to discriminate the underlying processes. These experiments demonstrate that using a Hawkes representation recovers the empiricism reported in the literature under simulation conditions that cannot be achieved when using a Brownian representation. The experiments are applied to Trade and Quote data from the Johannesburg Stock Exchange and the evidence suggests that the empirical measurements are from a system of discrete connected events where correlations are an emergent property of the time-scale rather than an instantaneous quantity that exists at all time-scales.
- ItemOpen AccessIdentifying outliers and influential observations in general linear regression models(2004) Katshunga, Dominique; Troskie, Casper GIdentifying outliers and/or influential observations is a fundamental step in any statistical analysis, since their presence is likely to lead to erroneous results. Numerous measures have been proposed for detecting outliers and assessing the influence of observations on least squares regression results. Since outliers can arise in different ways, the above mentioned measures are based on motivational arguments and they are designed to measure the influence of observations on different aspects of various regression results. In what follows, we investigate how one can combine different test statistics based on residuals and diagnostic plots to identify outliers and influential observations (both in the single and multiple case) in general linear regression models.
- ItemOpen AccessInvestigating 'optimal' kriging variance estimation :analytic and bootstrap estimators(2011) Ngwenya, Mzabalazo Z; Thiart, Christien; Haines, LindaKriging is a widely used group of techniques for predicting unobserved responses at specified locations using a set of observations obtained from known locations. Kriging predictors are best linear unbiased predictors (BLUPs) and the precision of predictions obtained from them are assessed by the mean squared prediction error (MSPE), commonly termed the kriging variance.
- ItemOpen AccessAn investigation into Functional Linear Regression Modeling(2015) Essomba, Rene Franck; Lubbe, SugnetFunctional data analysis, commonly known as FDA", refers to the analysis of information on curves of functions. Key aspects of FDA include the choice of smoothing techniques, data reduction, model evaluation, functional linear modeling and forecasting methods. FDA is applicable in numerous applications such as Bioscience, Geology, Psychology, Sports Science, Econometrics, Meteorology, etc. This dissertation main objective is to focus more specifically on Functional Linear Regression Modelling (FLRM), which is an extension of Multivariate Linear Regression Modeling. The problem of constructing a Functional Linear Regression modelling with functional predictors and functional response variable is considered in great details. Discretely observed data for each variable involved in the modelling are expressed as smooth functions using: Fourier Basis, B-Splines Basis and Gaussian Basis. The Functional Linear Regression Model is estimated by the Least Square method, Maximum Likelihood method and more thoroughly by Penalized Maximum Likelihood method. A central issue when modelling Functional Regression models is the choice of a suitable model criterion as well as the number of basis functions and an appropriate smoothing parameter. Four different types of model criteria are reviewed: the Generalized Cross-Validation, the Generalized Information Criterion, the modified Akaike Information Criterion and Generalized Bayesian Information Criterion. Each of these aforementioned methods are applied to a dataset and contrasted based on their respective results.
- ItemOpen AccessLine transect abundance estimation with uncertain detection on the trackline(1996) Borchers, D LAfter critically reviewing developments in line transect estimation theory to date, general likelihood functions are derived for the case in which detection probabilities are modelled as functions of any number of explanatory variables and detection of animals on the trackline (i.e. directly in the observer's path) is not certain. Existing models are shown to correspond to special cases of the general models. Maximum likelihood estimators are derived for some special cases of the general model and some existing line transect estimators are shown to correspond to maximum likelihood estimators for other special cases. The likelihoods are shown to be extensions of existing mark-recapture likelihoods as well as being generalizations of existing line transect likelihoods. Two new abundance estimators are developed. The first is a Horvitz-Thompson-like estimator which utilizes the fact that for point estimation of abundance the density of perpendicular distances in the population can be treated as known in appropriately designed line transect surveys. The second is based on modelling the probability density function of detection probabilities in the population. Existing line transect estimators are shown to correspond to special cases of the new Horvitz-Thompson-like estimator, so that this estimator, together with the general likelihoods, provides a unifying framework for estimating abundance from line transect surveys.
- ItemOpen AccessLoss distributions in consumer credit risk : macroeconomic models for expected and unexpected loss(2016) Malwandla, Musa; Rajaratnam, Kanshukan; Clark, AllanThis thesis focuses on modelling the distributions of loss in consumer credit arrangements, both at an individual level and at a portfolio level, and how these might be influenced by loan-specific factors and economic factors. The thesis primarily aims to examine how these factors can be incorporated into a credit risk model through logistic regression models and threshold regression models. Considering the fact that the specification of a credit risk model is influenced by its purpose, the thesis considers the IFRS 7 and IFRS 9 accounting requirements for impairment disclosure as well as Basel II regulatory prescriptions for capital requirements. The thesis presents a critique of the unexpected loss calculation under Basel II by considering the different ways in which loans can correlate within a portfolio. Two distributions of portfolio losses are derived. The Vašíček distribution, which is the assumed in Basel II requirements, was originally derived for corporate loans and was never adapted for application in consumer credit. This makes it difficult to interpret and validate the correlation parameters prescribed under Basel II. The thesis re-derives the Vašíček distribution under a threshold regression model that is specific to consumer credit risk, thus providing a way to estimate the model parameters from observed experience. The thesis also discusses how, if the probability of default is modelled through logistic regression, the portfolio loss distribution can be modelled as a log-log-normal distribution.