Browsing by Author "Er, Sebnem"
Now showing 1 - 15 of 15
Results Per Page
Sort Options
- ItemOpen AccessAn analysis of household water consumption in the City of Cape Town using a panel data set (2016-2020)(2022) Kaplan, Anna Leah; Er, Sebnem; Visser, MartineUnderstanding consumer behaviour with respect to water consumption has become an active field of study. This thesis uses a household billing dataset that tracks the quantity of water consumed by households in the City of Cape Town (CoCT) from 2016 to 2020. The household billing data was filtered to include only household observations and then aggregated to the ward level. As a result, the aggregated data is a balanced spatial panel dataset including 20 quarterly observations for each of the 88 wards. Using the billing data set, multiple linear regression models, panel data models as well as spatial panel models were implemented to predict ward level water consumption. Using several visualisations and statistical measures, this thesis found that consumption dropped significantly during the drought period (2016-2018) and also found spatial clusters of water consumption in the CoCT. The data showed that before and after the drought, water consumption exhibited a seasonal pattern which was absent during the drought period. It is also noted that although consumption levels after the drought increase, they do not rise as high as pre-drought levels. The linear models implemented in this thesis resulted in an Adjusted R-squared values of up to 0.85, implying that the independent variables used in the models explain a large amount of variation observed in the dependent variable, quantity of ward level water consumption.
- ItemOpen AccessAnalysis of gender wage gap using mixed effects models(2025) Chikanya, Magnolia M; Er, Sebnem; Silal, SheetalDespite government interventions, the gender wage gap persists in workplaces. While reports on whether the gap is widening or narrowing vary, addressing this issue remains crucial. Traditionally, researchers have employed methods like the Blinder-Oaxaca decomposition and quantile regression to estimate the gender wage gap. However, these approaches often leave a high unexplained variance attributed to discrimination. In existing studies, gender wage gap estimates have typically been aggregated, and attempts to disaggregate the analysis have focused on broader levels such as occupations and salary bands. To delve deeper, human resource data from the National Department of Health in South Africa Eastern Cape province was leveraged. The goal was to analyze the gender wage gap for each job title using a novel approach: linear mixed effects regression. The linear mixed effects model captures both systematic trends and unexplained variability simultaneously to provide a more comprehensive understanding of the gender wage gap. Here are the key findings: 1. The unexplained variance in gender wage gap was remarkably low, accounting for only 3% of total variance. This indicates that the model captures most of the variability in the data as a result there is minimal unexplained variation. 2. Job titles emerged very significant by explaining 83% of the total random variance. This highlights the significance of considering specific roles when analyzing gender wage gap. 3. Over time, interesting patterns were observed. From 2010, the gender wage gap narrowed, but starting around 2015, it gradually widened again. 4. Encouragingly, 42% of the job title groups showed a gender wage gap in favor of women. Additionally, a substantial proportion of females occupied managerial and highly skilled positions. Therefore, incorporating random effects techniques through linear mixed effects regression enriched the analysis of gender wage gap. By examining job titles individually, detailed insights into this complex issue were gained. These findings underscore the importance of considering both fixed and random effects when studying wage disparities.
- ItemOpen AccessAutomated detection and classification of red roman in unconstrained underwater environments using Mask R-CNN(2021) Conrady, Christopher; Er, Sebnem; Attwood, Colin GThe availability of relatively cheap, high-resolution digital cameras has led to an exponential increase in the capture of natural environments and their inhabitants. Videobased surveys are particularly useful in the underwater domain where observation by humans can be expensive, dangerous, inaccessible, or destructive to the natural environment. Moreover, video-based surveys offer an unedited record of biodiversity at a given point in time – one that is not reliant on human recall or susceptible to observer bias. In addition, secondary data that is useful in scientific study (date, time, location, etc.) are by default stored in almost all digital formats as metadata. When analysed effectively, this growing body of digital data offers the opportunity for robust and independently reproducible scientific study of marine biodiversity (and how this might change over time, for example). However, the manual review of image and video data by humans is slow, expensive, and not scalable. A large majority of marine data has never gone through analysis by human experts. This necessitates computer-based (or automated) methods of analysis that can be deployed at a fraction of the time and cost, at a comparable accuracy. Mask R-CNN, a deep learning object recognition framework, has outperformed all previous state-of-the-art results on competitive benchmarking tasks. Despite this success, Mask R-CNN and other state-of-the-art object recognition techniques have not been widely applied in the underwater domain, and not at all within the context of South Africa. To address this gap in the literature, this thesis contributes (i) a novel image dataset of red roman (Chrysoblephus laticeps), a fish species endemic to Southern Africa, and (ii) a Mask R-CNN framework for the automated localisation, classification, counting, and tracking of red roman in unconstrained underwater environments. The model, trained on an 80:10:10 split, accurately detected and classified red roman on the training dataset (mAP50 = 80.29%), validation dataset (mAP50 = 80.35%), as well as on previously unseen footage (test dataset) (mAP50 = 81.45%). The fact that the model performs equally well on unseen footage suggests that it is capable of generalising to new streams of data not used in this research – this is critical for the utility of any statistical model outside of “laboratory conditions”. This research serves as a proof-of-concept that machine learning based methods of video analysis of marine data can replace or at least supplement human analysis.
- ItemOpen AccessBiplots based on principal surfaces(2019) Ganey, Raeesa; Er, Sebnem; Lubbe, SugnetPrincipal surfaces are smooth two-dimensional surfaces that pass through the middle of a p-dimensional data set. They minimise the distance from the data points, and provide a nonlinear summary of the data. The surfaces are nonparametric and their shape is suggested by the data. The formation of a surface is found using an iterative procedure which starts with a linear summary, typically with a principal component plane. Each successive iteration is a local average of the p-dimensional points, where an average is based on a projection of a point onto the nonlinear surface of the previous iteration. Biplots are considered as extensions of the ordinary scatterplot by providing for more than three variables. When the difference between data points are measured using a Euclidean embeddable dissimilarity function, observations and the associated variables can be displayed on a nonlinear biplot. A nonlinear biplot is predictive if information on variables is added in such a way that it allows the values of the variables to be estimated for points in the biplot. Prediction trajectories, which tend to be nonlinear are created on the biplot to allow information about variables to be estimated. The goal is to extend the idea of nonlinear biplot methodology onto principal surfaces. The ultimate emphasis is on high dimensional data where the nonlinear biplot based on a principal surface allows for visualisation of samples, variable trajectories and predictive sets of contour lines. The proposed biplot provides more accurate predictions, with an additional feature of visualising the extent of nonlinearity that exists in the data.
- ItemOpen AccessBuilding a question answering system for the introduction to statistics course using supervised learning techniques(2020) Leonhardt, Waldo; Er, Sebnem; Scott, LeanneQuestion Answering (QA) is the task of automatically generating an answer to a question asked by a human in natural language. Open-domain QA is still a difficult problem to solve even after 60 years of research in this field, as trying to answer questions which cover a wide range of subjects is a complex matter. Closed-domain QA is, on the other hand, more achievable as the context for asking questions is restricted and allows for more accurate interpretation. This dissertation explores how a QA system could be built for the Introduction to Statistics course taught online at the University of Cape Town (UCT), for the purpose of answering administrative queries. This course runs twice a year and students tend to ask similar administrative questions each time that the course is run. If a QA system can successfully answer these questions automatically, it would save lecturers the time in having to do so manually, as well as enabling students to receive the answers immediately. For a machine to be able to interpret natural language questions, methods are needed to transform text into numbers while still preserving the meaning of the text. The field of Natural Language Processing (NLP) offers the building blocks for such methods that have been used in this study. After predicting the category of a new question using Multinomial Logistic Regression (MLR), the past question that is most similar to the new question is retrieved and its answer is used for the new question. The following five classifiers, Naive Bayes, Logistic Regression, Support Vector Machines, Stochastic Gradient Descent and Random Forests were compared to see which one provides the best results for the categorisation of a new question. The cosine similarity method was used to find the most similar past question. The Round-Trip Translation (RTT) technique was explored as an augmentation method for text, in an attempt to increase the dataset size. Methods were compared using the initial base dataset of 744 questions, compared to the extended dataset of 6 614 questions, which was generated as a result of the RTT technique. In addition to these two datasets, features for Bag-of-Words (BoW), Term Frequency times Inverse Document Frequency (TF-IDF), Latent Semantic Analysis (LSA), Latent Dirichlet Allocation (LDiA), pre-trained Global Vector (GloVe) word embeddings and customengineered features were also compared. This study found that a model using an MLR classifier with TF-IDF unigram and bigram features (built on the smaller 744 questions dataset) performed the best, with a test F1-measure of 84.8%. Models using a Stochastic Gradient Descent classifier also performed very well with a variety of features, indicating that Stochastic Gradient Descent is the most versatile classifier to use. No significant improvements were found using the extended RTT dataset of 6 614 questions, but this dataset was used by the model that ranked eighth in position. A simulator was also built to illustrate and test how a bot (an autonomous program on a network that is able to interact with users) can be used to facilitate the auto-answering of student questions. This simulator proved very useful and helped to identify the fact that questions relating to the Course Information Pack had been excluded from the data that had been initially sourced, as students had been asking such questions through other platforms. Building a QA system using a small dataset proved to be very challenging. Restricting the domain of questions and focusing only on administrative queries was helpful. Lots of data cleaning was needed and all past answers needed to be rewritten and standardised, as the raw answers were too specific and did not generalise well. The features that performed the best for cosine similarity and for extracting the most similar past question were LSA topics built from TF-IDF unigram features. Using LSA topics as the input for cosine similarity, instead of the raw TF-IDF features,resolved the “curse of dimensionality”. Issues with cosine similarity were observed in cases where it favoured short documents, which often led to the selection of the wrong past question. As an alternative, the use of more advanced language-modelling-based similarity measures are suggested for future study. Either, pre-trained word embeddings such as GloVe could be used as a language model, or a custom language model could be trained. A generic UCT language model could be valuable and it would be preferable to build such a language model using the entire digital content of Vula across all faculties where students converse, ask questions or post comments. Building a QA system using this UCT language model is foreseen to offer better results, as terms like “Vula”, “DP”, “SciLab” and “jdlt1” would be endowed with more meaning.
- ItemOpen AccessCape Town road traffic accident analysis: Utilising supervised learning techniques and discussing their effectiveness(2022) Du Toit, Christo; Er, Sebnem; Salau, SulaimanRoad traffic accidents (RTA) are a major cause of death and injury around the world and in South Africa. Methods to understand and reduce the frequency and injury severity of RTAs are of utmost importance. There is limited South African literature on modelling RTA injury-severity using supervised learning (SL) methods that fit a model that relates a target variable to a set of predictor variables. In this thesis, multinomial logistic regression, classification trees (CT), random forests (RF), gradient boosted machines (GBM) and artificial neural networks (ANN) are used to model the potentially non-linear relationships between accident-related factors and injury-severity. Data on RTAs that occurred in the city of Cape Town during the period 2015-2017 are used for this study. The data contain the injury-severity of the RTAs as well as several accident related variables. The injury-severity categories of RTAs are classified as: “no injury”, “slight”, “serious” and “fatal” injury. Additional locational and situational variables were added to the dataset. The exploratory analysis revealed that the vast majority of alleged causes (as deduced by the data capturers from the accident report) of RTAs are related to driver/human error, accidents with pedestrians make up only 5.86% of all RTAs yet account for 58.56% of “fatal” accidents and 55.37% of “serious” accidents, the majority of “fatal” and “serious” RTAs occur on the weekend and involve only one vehicle. It was also identified that the RTA data was severely imbalanced with regards to injury-severity. Imbalanced data occur when the number of observations belonging to each of the classification categories are not approximately equal and can negatively affect the performance of classification methods. This paper employed three common approaches to address class imbalance namely (i) under sampling of the majority class, (ii) oversampling of the minority class and (iii) the synthetic minority oversampling technique (SMOTE). The RTA data was split into training, validation and test sets keeping the proportions of the injury-severity category consistent. Four training datasets were analysed: the original imbalanced data, data with the minority class over-sampled, data with the majority class under-sampled and data with synthetically created observations. The performance of the SL methods trained on these four different datasets were compared using accuracy, recall, precision and F1 score as evaluation metrics. All three data sampling methods improved the CT, RF and GBM model's average recall and ability to identify observations belonging to the minority class (“fatal” RTAs). With regards to maximising average recall, the SMOTE technique was the most effective data sampling method to address class imbalance. Further analysis was done to determine whether simple SL methods such as multinomial logistic regression are sufficient to model RTA injury-severity or if more complex SL methods such as ANNs are required. The ANN model achieved a higher average recall and correctly identified more observations belonging to the minority class, “fatal” RTAs, than the multinomial logistic regression model. Using average recall as the main evaluation metric, the ANN was selected as the “best” performing model on the validation data. The ANN model correctly identified a large number of “fatal” RTAs while also resulting in a high number of false positives. The ANN model was very effective at correctly identifying “no injury” RTAs as evidenced by the high recall and precision scores, but performed poorly at correctly identifying “slight” and “serious” RTAs. Finally, the variable importance of the CT, RF and GBM models trained on the SMOTE data revealed the geographical location of an RTA, crash type as well as the number of vehicles involved in an accident to be significant risk factors associated with RTA injury-severity. The CT and RF models both determined the alleged cause of an accident to be significant, while the RF and GBM models determined several weather-related variables to be significant risk factors associated with RTA injury-severity. Future road safety policies should focus on reducing human/driver error, reducing pedestrian-related RTAs and increasing policing efforts over weekends and during poor weather conditions. Road safety policies should take the geographical location of RTAs into account in order to identify high-risk areas for “serious” and “fatal” RTAs.
- ItemOpen AccessEstimating Poverty from Aerial Images Using Convolutional Neural Networks Coupled with Statistical Regression Modelling(2019) Maluleke, Vongani; Er, Sebnem; Williams, QuentinPolicy makers and the government rely heavily on survey data when making policyrelated decisions. Survey data is labour intensive, costly and time consuming, hence it cannot be frequently or extensively collected. The main aim of this research is to demonstrate how Convolutional Neural Network (CNN) coupled with statistical regression modelling can be used to estimate poverty from aerial images supplemented with national household survey data. This provides a more frequent and automated method for updating data that can be used for policy making. This aerial poverty estimation approach is executed in two phases; aerial classification and detection phase and poverty modelling phase. The aerial classification and detection phase use CNN to perform settlement typology classification of the aerial images into three broad geotype classes namely; urban, rural and farm. This is then followed by object detection to detect three broad dwelling type classes in the aerial images namely; brick house, traditional house, and informal settlement. Mask Region-based Convolutional Neural Network (Mask R-CNN) model with a resnet101 CNN backbone model is used to perform this task. The second phase, poverty modelling phase, involves using NIDS data to compute the poverty measure Sen-Shorrocks-Thon (SST) index. This is followed by using regression models to model the poverty measure using aggregated results from the aerial classification and detection phase. The study area for this research is Kwa-Zulu Natal (KZN), South Africa. However, this approach can be extended to other provinces in South Africa, by retraining the models on data associated with the location in question.
- ItemOpen AccessHospital readmission risk(2024) Mugova, Amos; Salau, Sulaiman; Er, SebnemHospital readmissions are a significant challenge in healthcare, as they lead to in creased costs, higher risk of mortality, treatment complications, and patient dis tress. This minor dissertation, set within the South African healthcare framework, investigates the potential of both traditional clinical screening tools and advanced statistical learning methods for predicting hospital readmission risk. The meth ods considered include the LACE score, decision trees, logistic regression, random forests, gradient-boosting methods, and neural networks. The study uses data from South Africa's privately insured demographic, provided by a private insurer. It includes a comprehensive array of patient information such as demographics, prescribed medications, medical procedures undergone, and historical hospital usage. Feature selection methods were used to identify relevant variables for model training, and the effectiveness of these variables was assessed based on their ability to differentiate between patients at risk of hospital readmission within 30 days after discharge. The statistical learning methods' efficacy was measured using several performance indicators, such as prediction accuracy, F1 score, Area Under the Receiver Operating Characteristics Curve (AUC), Area Under the Precision-Recall Curve (AUC-PR), and the Matthews Correlation Coefficient (MCC). The study found that the neural network model outperformed the other statistical learning methods evaluated across various metrics. Moreover, the research extends the range of variables used to predict hospital read missions beyond the traditional LACE score, incorporating critical factors such as the frequency and costs of previous hospital visits, expenses related to specialist services, patient age, and the primary diagnosis category.
- ItemOpen AccessInsurance recommendation engine using a combined collaborative filtering and neural network approach(2021) Pillay, Prinavan; Er, Sebnem; Clark, AllanA recommendation engine for insurance modelling was designed, implemented and tested using a neural network and collaborative filtering approach. The recommendation engine aims to suggest suitable insurance products for new or existing customers, based on their features or selection history. The collaborative filtering approach used matrix factorization on an existing user base to provide recommendation scores for new products to existing users. The content based method used a neural network architecture which utilized user features to provide a product recommendation for new users. Both methods were deployed using the Tensorflow machine learning framework. The hybrid approach helps solve for cold start problems where users have no interaction history. The accuracy on the collaborative filtering produced 0.13 root mean square error based on implicit feedback rating of 0-1, and an overall Top-3 classification accuracy (ability to predict one of the top 3 choices of a customer) of 83.8%. The neural network system achieved an accuracy of 77.2% on Top-3 classification. The system thus achieved good training performance and given further modifications, could be used in a production environment.
- ItemOpen AccessNatural Language Financial Forecasting: The South African Context(2021) Katende, Simon; Er, Sebnem; Nyirenda, Juwa; Rajaratnam, KanshukanThe stock market plays a fundamental role in any country's economy as it efficiently directs the flow of savings and investments of an economy in ways that advances the accumulation of capital and the production of goods and services. Factors that affect the price movement of stocks include company news and performance, macroeconomic factors, market sentiment as well as unforeseeable events. The conventional prediction approach is based on historical numerical data such as price trends and trading volumes to name a few. This thesis reviews the literature of Natural Language Financial Forecasting (NLFF) and proposes novel implementation techniques with the use of Stock Exchange News Service (SENS) announcements to predict stock price trends with machine learning methods. Deep Learning has recently sparked interest in the data science communities, but the literature on the application of deep learning in stock prediction, especially in emerging markets like South Africa, is still limited. In this thesis, the process of labelling announcements, the use of a more statistically relevent technique called the event study was used. Classical textual preprocessing and representation techniques were replaced with state-of-the-art sentence embeddings. Deep learning models (Deep Neural Network (DNN)) were then compared to Classical Models (Logistic Regression (LR)). These models were trained, optimized and deployed using the Tensorflow Machine Learning (ML) framework on Google Cloud AI Platform. The comparison between the performance results of the models shows that both DNN and LR have potential operational capabilites to use information dissemination as a means to assist market participants with their trading decisions.
- ItemOpen AccessObject Detection and Size Determination of Pineapple Fruit at a Juicing Factory(2021) Harris, Jessica; Er, SebnemThe aim of this thesis is to develop a method for determining pineapple fruit size from images. This was achieved by first detecting pineapples in each image using Mask Region-based Convolutional Neural Network (Mask R-CNN) and then extracting the pixel diameter and length measurements, and the projected areas, from the detected mask outputs. Various Mask R-CNNs were considered for the task of pineapple detection. The best-performing detector made use of MS COCO starting weights, a ResNet50 CNN backbone, and horizontal flipping data augmentation during the training process. This model (Model 4: COCO Fliplr Res50) achieved an average precision of 91.4% on the validation set and an average precision of 90.1% on the test set, and was used to predict masks for an unseen dataset containing images of pre-measured pineapples. The distributions of measurements extracted from the detected masks were compared to those of the manual measurements using two-sample Z-tests and Kolmogorov–Smirnov (KS) tests. There was sufficient similarity between the distributions, and it was therefore established that the reported method is appropriate for pineapple size determination in this context. All the data and code is available in a GitHub repository for reproducible research.
- ItemOpen AccesssoMLier: A South African Wine Recommender System(2022) Redelinghuys, Joshua; Er, SebnemThough several commercial wine recommender systems exist, they are largely tailored to consumers outside of South Africa (SA). Consequently, these systems are of limited use to novice wine consumers in SA. To address this, the aim of this research is to develop a system for South African consumers that yields high-quality wine recommendations, maximises the accuracy of predicted ratings for those recommendations and provides insights into why those suggestions were made. To achieve this, a hybrid system “soMLier” (pronounced “sommelier”) is built in this thesis that makes use of two datasets. Firstly, a database containing several attributes of South African wines such as the chemical composition, style, aroma, price and description was supplied by wine.co.za (a SA wine retailer). Secondly, for each wine in that database, the numeric 5-star ratings and textual reviews made by users worldwide were further scraped from Vivino.com to serve as a dataset of user preferences. Together, these are used to develop and compare several systems, the most optimal of which are combined in the final system. Item-based collaborative filtering methods are investigated first along with model-based techniques (such as matrix factorisation and neural networks) when applied to the user rating dataset to generate wine recommendations through the ranking of rating predictions. Respectively, these methods are determined to excel at generating lists of relevant wine recommendations and producing accurate corresponding predicted ratings. Next, the wine attribute data is used to explore the efficacy of content-based systems. Numeric features (such as price) are compared along with categorical features (such as style) using various distance measures and the relationships between the textual descriptions of the wines are determined using natural language processing methods. These methods are found to be most appropriate for explaining wine recommendations. Hence, the final hybrid system makes use of collaborative filtering to generate recommendations, matrix factorisation to predict user ratings, and content-based techniques to rationalise the wine suggestions made. This thesis contributes the “soMLier” system that is of specific use to SA wine consumers as it bridges the gap between the technologies used by highly-developed existing systems and the SA wine market. Though this final system would benefit from more explicit user data to establish a richer model of user preferences, it can ultimately assist consumers in exploring unfamiliar wines, discovering wines they will likely enjoy, and understanding their preferences of SA wine.
- ItemOpen AccessTriplet entropy loss: improving the generalisation of short speech language identification systems(2021) Van Der Merwe, Ruan Henry; Er, SebnemSpoken language identification systems form an integral part in many speech recognition tools today. Over the years many techniques have been used to identify the language spoken, given just the audio input, but in recent years the trend has been to use end to end deep learning systems. Most of these techniques involve converting the audio signal into a spectrogram which can be fed into a Convolutional Neural Network which can then predict the spoken language. This technique performs very well when the data being fed to model originates from the same domain as the training examples, but as soon as the input comes from a different domain these systems tend to perform poorly. Examples could be when these systems were trained on WhatsApp recordings but are put into production in an environment where the system receives recordings from a phone line. The research presented investigates several methods to improve the generalisation of language identification systems to new speakers and to new domains. These methods involve Spectral augmentation, where spectrograms are masked in the frequency or time bands during training and CNN architectures that are pre-trained on the Imagenet dataset. The research also introduces the novel Triplet Entropy Loss training method. This training method involves training a network simultaneously using Cross Entropy and Triplet loss. Several tests were run with three different CNN architectures to investigate what the effect all three of these methods have on the generalisation of an LID system. The tests were done in a South African context on six languages, namely Afrikaans, English, Sepedi, Setswanna, Xhosa and Zulu. The two domains tested were data from the NCHLT speech corpus, used as the training domain, with the Lwazi speech corpus being the unseen domain. It was found that all three methods improved the generalisation of the models, though not significantly. Even though the models trained using Triplet Entropy Loss showed a better understanding of the languages and higher accuracies, it appears as though the models still memorise word patterns present in the spectrograms rather than learning the finer nuances of a language. The research shows that Triplet Entropy Loss has great potential and should be investigated further, but not only in language identification tasks but any classification task.
- ItemOpen AccessUnsupervised Machine Learning Application for the Identification of Kimberlite Ore Facie using Convolutional Neural Networks and Deep Embedded Clustering(2021) Langton, Sean; Er, SebnemMining is a key economic contributor to many regions globally - especially those in developing nations. The design and operation of the processing plants associated with each of these mines is highly dependant on the composition of the feed material. The aim of this research is to demonstrate the viability of implementing a computer vision solution to provide online information of the composition of material entering the plant, thus allowing the plant operators to adjust equipment settings and process parameters accordingly. Data is collected in the form of high resolution images captured every couple of seconds of material on the main feed conveyor belt into the Kao Diamond Mine processing plant. The modelling phase of the research is implemented in two stages. The first stage involves the implementation of a Mask Region-based Convolutional Neural Network (Mask R-CNN) model with a ResNet 101 CNN backbone for instance segmentation of individual rocks from each image. These individual rock images are extracted and used for the second phase of the modelling pipeline - utilizing an unsupervised clustering method known as Convolutional Deep Embedded Clustering with Data Augmentation (ConvDEC-DA). The clustering phase of this research provides a method to group feed material rocks into their respective types or facie using features developed from the auto-encoder portion of the ConvDEC-DA modelling. While this research focuses on the clustering of Kimberlite rocks according to their respective facie, similar implementations are possible for a wide range of mining and rock types.
- ItemOpen AccessWord Sense Disambiguation in the domain of Sentiment Analysis through Deep Learning(2022) Baiju, Vedanth; Er, Sebnem; Dufourq, EmmanuelSentiment analysis forms part of a major component of Natural Language Processing (NLP), even though continuous improvements in NLP are being made, word disambiguation remains a complex problem within the domain of sentiment analysis (Navigli, 2009). Word Sense Disambiguation (WSD) is a problem that deals with identifying the correct sense of ambiguous words in a sentence. As such, various words can have multiple meanings depending on the context in which they are used. Although advances in deep learning continue to rise within the NLP domain, WSD is still a task in which deep learning is yet to be fully explored. Whilst there does exist research within WSD as a whole, there is limited research for WSD conducted within the domain of sentiment analysis (Seifollahi and Shajari, 2019). The proposed research explores the task of WSD in the domain of sentiment analysis through recent advances in deep neural networks with a specific focus on 1D Convolutional Neural Networks (CNN) and Long Short Term Memory (LSTM) algorithms. Sentiments expressed in text sourced from the Amazon product reviews data were analysed using 1D CNN and LSTM deep learning algorithms. The Amazon product reviews data is segmented according to the type of product category which is essentially a context category. The effectiveness of each algorithm was evaluated from a statistical performance and efficiency perspective. It was found that the inclusion of context as a model input, improves the model out of sample performance as compared to a model without context as an input. In addition to this, it was observed that including more context categories as an input had improved the out of sample performance for both 1D CNN and LSTM algorithms. Furthermore, the 1D CNN exhibited superior performance over the LSTM model from a statistical and efficiency stand-point. Given that there has not been a considerable amount of research which explores the application of deep learning to solving the problem of WSD within sentiment analysis, the findings of this research will aid in providing a base-level of knowledge on future potential exploration and applications for WSD relating to sentiment analysis.