Browsing by Author "Varughese, Melvin"
Now showing 1 - 6 of 6
Results Per Page
Sort Options
- ItemOpen AccessClassification and visualisation of text documents using networks(2018) Phaweni, Thembani; Durbach, Ian; Varughese, Melvin; Bassett, BruceIn both the areas of text classification and text visualisation graph/network theoretic methods can be applied effectively. For text classification we assessed the effectiveness of graph/network summary statistics to develop weighting schemes and features to improve test accuracy. For text visualisation we developed a framework using established visual cues from the graph visualisation literature to communicate information intuitively. The final output of the visualisation component of the dissertation was a tool that would allow members of the public to produce a visualisation from a text document. We represented a text document as a graph/network. The words were nodes and the edges were created when a pair of words appeared within a pre-specified distance (window) of words from each other. The text document model is a matrix representation of a document collection such that it can be integrated into a machine or statistical learning algorithm. The entries of this matrix can be weighting according to various schemes. We used the graph/network representation of a text document to create features and weighting schemes that could be applied to the text document model. This approach was not well developed for text classification therefore we applied different edge weighting methods, window sizes, weighting schemes and features. We also applied three machine learning algorithms, naïve Bayes, neural networks and support vector machines. We compared our various graph/network approaches to the traditional document model with term frequency inverse-document-frequency. We were interested in establishing whether or not the use of graph weighting schemes and graph features could increase test accuracy for text classification tasks. As far as we can tell from the literature, this is the first attempt to use graph features to weight bag-of-words features for text classification. These methods had been applied to information retrieval (Blanco & Lioma, 2012). It seemed they could also be applied to text classification. The text visualisation field seemed divorced from the text summarisation and information retrieval fields, in that text co-occurrence relationships were not treated with equal importance. Developments in the graph/network visualisation literature could be taken advantage of for the purposes of text visualisation. We created a framework for text visualisation using the graph/network representation of a text document. We used force directed algorithms to visualise the document. We used established visual cues like, colour, size and proximity in space to convey information through the visualisation. We also applied clustering and part-of-speech tagging to allow for filtering and isolating of specific information within the visualised document. We demonstrated this framework with four example texts. We found that total degree, a graph weighting scheme, outperformed term frequency on average. The effect of graph features depended heavily on the machine learning method used: for the problems we considered graph features increased accuracy for SVM classifiers, had little effect for neural networks and decreased accuracy for naïve Bayes classifiers Therefore the impact on test accuracy of adding graph features to the document model is dependent on the machine learning algorithm used. The visualisation of text graphs is able to convey meaningful information regarding the text at a glance through established visual cues. Related words are close together in visual space and often connected by thick edges. Large nodes often represent important words. Modularity clustering is able to extract thematically consistent clusters from text graphs. This allows for the clusters to be isolated and investigated individually to understand specific themes within a document. The use of part-of-speech tagging is effective in both reducing the number of words being displayed but also increasing the relevance of words being displayed. This was made clear through the use of part-of-speech tags applied to the Internal Resistance of Apartheid Wikipedia webpage. The webpage was reduced to its proper nouns which contained much of the important information in the text. Training accuracy is important in text classification which is a task that can often be performed on vast amounts of documents. Much of the research in text classification is aimed at increasing classification accuracy either through feature engineering, or optimising machine learning methods. The finding that total degree outperformed term frequency on average provides an alternative avenue for achieving higher test accuracy. The finding that the addition of graph features can increase test accuracy when matched with the right machine learning algorithm suggests some new research should be conducted regarding the role that graph features can have in text classification. Text visualisation is used as an exploratory tool and as a means of quickly and easily conveying text information. The framework we developed is able to create automated text visualisations that intuitively convey information for short and long text documents. This can greatly reduce the amount of time it takes to assess the content of a document which can increase general access to information.
- ItemOpen AccessFlamingo foraging plasticity: ecological drivers and impacts(2017) Gihwala, Kirti Narendra; Pillay, Deena; Varughese, MelvinThe consequences of predation have become a central focus of marine ecological research. Numerous studies have emphasized the importance of apex predators in structuring assemblages at various organisational levels and in determining how ecosystems function. However, less appreciated currently is the fact that predators display multiple foraging behaviours, thereby allowing them to overcome problems associated with unpredictability of food resources in space and time. The primary goal of this dissertation is to contribute to growing understanding of the ecological causes and consequences of foraging plasticity displayed by Greater Flamingo Phoenicopterus ruber roseus in intertidal sandflat ecosystems in Langebaan Lagoon, South Africa. P. roseus feeds by either (1) creating pits, which involves flamingos stirring up deep sediments with their feet or (2) creating channels, in which their inverted bills are swept from side-to-side on the sediment surface. The first objective of the study was to quantify the ecological drivers of decisions made by flamingos to feed, and to implement either pit- or channel-foraging strategies. The latter was achieved through RandomForest modelling techniques that identified the prominent ecological drivers from a suite of biotic and abiotic variables. Results indicate that biotic variables, i.e. those associated with flamingo prey assemblages, were key in driving choices made by flamingos to forage and to implement either pit- or channel-foraging strategies. The second aim of this dissertation was to quantify the repercussions of the two different foraging behaviours on benthic assemblages. Comparisons of benthic assemblages in flamingo foraging structures (pits and channels) with adjacent non-foraged sediments (controls) indicated differential effects of both flamingo foraging methods on benthic communities, with channel-foraging eliciting a greater negative impact compared to pit-foraging, for which impacts were negligible. Abundance of macrofauna and surface-dwelling taxa such as micro-algae and the amphipod Urothoe grimaldii were all negatively impacted by channel-foraging. Sizes of channels constructed by flamingos were inversely related to their impacts, with impacts on macrofaunal abundance being greater in smaller channels. Overall, this study has shed light on the differential effects of foraging plasticity on prey assemblages and its importance in enhancing spatio-temporal heterogeneity in intertidal sandflats. The study also emphasizes the need to incorporate foraging plasticity into current thinking and conceptual models of predation in marine soft sediments, in order to appreciate the full spectrum of predation effects on assemblages.
- ItemOpen AccessNon-Linear diffusion processes and applications(2016) Pienaar, Etienne A D; Varughese, MelvinDiffusion models are useful tools for quantifying the dynamics of continuously evolving processes. Using diffusion models it is possible to formulate compact descriptions for the dynamics of real-world processes in terms of stochastic differential equations. Despite the exibility of these models, they can often be extremely difficult to work with. This is especially true for non-linear and/or time-inhomogeneous diffusion models where even basic statistical properties of the process can be elusive. As such, we explore various techniques for analysing non-linear diffusion models in contexts ranging from conducting inference under discrete observation and solving first passage time problems, to the analysis of jump diffusion processes and highly non-linear diffusion processes. We apply the methodology to a number of real-world ecological and financial problems of interest and demonstrate how non-linear diffusion models can be used to better understand such phenomena. In conjunction with the methodology, we develop a series of software packages that can be used to accurately and efficiently analyse various classes of non-linear diffusion models.
- ItemOpen AccessParameter estimation of a bivariate diffusion process : the Heston model(2011) Nomoyi, Siyabulela; Varughese, MelvinThe main objective of the research is to estimate the parameters on the Heston (1993) model, which models the movement of asset prices assuming that the asset price volatility is stochastic. The paper concentrates on estimating these parameters by approximating the transitional probabilities of the diffusion process with a saddlepoint distribution. By solving a system of ordinary differential equations that are in terms of the system’s cumulants, and using these solutions to calculate the saddlepoint, the transitional probabilities of the diffusion process can be approximated.
- ItemOpen AccessA recommender system for e-retail(2016) Walwyn, Thomas; Varughese, MelvinThe e-retail sector in South Africa has a significant opportunity to capture a large portion of the country's retail industry. Central to seizing this opportunity is leveraging the advantages that the online setting affords. In particular, the e-retailer can offer an extremely large catalogue of products; far beyond what a traditional retailer is capable of supporting. However, as the catalogue grows, it becomes increasingly difficult for a customer to efficiently discover desirable products. As a consequence, it is important for the e-retailer to develop tools that automatically explore the catalogue for the customer. In this dissertation, we develop a recommender system (RS), whose purpose is to provide suggestions for products that are most likely of interest to a particular customer. There are two primary contributions of this dissertation. First, we describe a set of six characteristics that all effective RS's should possess, namely; accuracy, responsiveness, durability, scalability, model management, and extensibility. Second, we develop an RS that is capable of serving recommendations in an actual e-retail environment. The design of the RS is an attempt to embody the characteristics mentioned above. In addition, to show how the RS supports model selection, we present a proof-of-concept experiment comparing two popular methods for generating recommendations that we implement for this dissertation, namely, implicit matrix factorisation (IMF) and Bayesian personalised ranking (BPR).
- ItemOpen AccessSequential nonparametric estimation via Hermite series estimators(2020) Stephanou, Michael Jared; Varughese, MelvinAlgorithms for estimating the statistical properties of streams of data in real time, as well as for the efficient analysis of massive data sets, are becoming particularly pertinent given the increasing ubiquity of such data. In this thesis we introduce novel approaches to sequential (online) estimation in both stationary and non-stationary settings based on Hermite series density estimators. In the univariate context we apply Hermite series based distribution function estimators to sequential cumulative distribution function estimation. These distribution function estimators are particularly useful because they allow the sequential estimation of the full cumulative distribution function. This is in contrast to the empirical distribution function estimator and smooth kernel distribution function estimator which only allow sequential cumulative probability estimation at predefined values on the support of the associated density function. We explore the asymptotic consistency and robustness properties of the Hermite series based cumulative distribution function estimator thereby redressing a gap in the literature. Given the sequential Hermite series based distribution function estimator, we obtain sequential quantile estimates numerically. Our algorithms go beyond existing sequential quantile estimation algorithms in that they allow arbitrary quantiles (as opposed to pre-specified quantiles) to be estimated at any point in time, in both the static and dynamic quantile estimation settings. In the bivariate context we introduce a Hermite series based sequential estimator for the Spearman's rank correlation coefficient and provide algorithms applicable in both the stationary and non-stationary settings. To treat the the non-stationary setting, we introduce a novel, exponentially weighted estimator for the Spearman's rank correlation, which allows the local nonparametric correlation of a bivariate data stream to be tracked. To the best of our knowledge this is the first algorithm to be proposed for estimating a time-varying Spearman's rank correlation that does not rely on a moving window approach. We explore the practical effectiveness of the Hermite series based estimators through real data and simulation studies, demonstrating competitive performance compared to leading existing algorithms. The potential applications of this work are manifold. Our sequential distribution function and quantile estimation algorithms can be applied to real time anomaly and outlier detection, real time provisioning for future demand as well as real time risk estimation for example. The Hermite series based Spearman's rank correlation estimator can be applied to fast and robust online calculation of correlation which may vary over time. Possible machine learning applications include fast feature selection and hierarchical clustering on massive data sets amongst others.