Browsing by Author "Salau, Sulaiman"
Now showing 1 - 3 of 3
Results Per Page
Sort Options
- ItemOpen AccessCape Town Airbnb price prediction: an exploration of spatial statistic and machine learning methods(2023) Williams, Courtney; Salau, Sulaiman; Er SebnemThis thesis predicts the prices of Airbnb listings in Cape Town, South Africa and in doing so, investigates the price determinants in the market. Using data from InsideAirbnb, traditional, spatial and machine learning models are compared and contrasted. The Cape Town Airbnb market has significant spatial correlation and heterogeneity, and traditional models such as OLS regression do not account for this spatial dependence, however, it is addressed by spatial models. By accounting for spatial effects, model predictive performance does improve, but not so much as to outperform non-spatial, non-linear machine learning model predictions. While Airbnb is a new and unique platform, the most important price determinants are consistent with those of traditional housing and accommodation markets such as property type, location and amenities.
- ItemOpen AccessCape Town road traffic accident analysis: Utilising supervised learning techniques and discussing their effectiveness(2022) Du Toit, Christo; Er, Sebnem; Salau, SulaimanRoad traffic accidents (RTA) are a major cause of death and injury around the world and in South Africa. Methods to understand and reduce the frequency and injury severity of RTAs are of utmost importance. There is limited South African literature on modelling RTA injury-severity using supervised learning (SL) methods that fit a model that relates a target variable to a set of predictor variables. In this thesis, multinomial logistic regression, classification trees (CT), random forests (RF), gradient boosted machines (GBM) and artificial neural networks (ANN) are used to model the potentially non-linear relationships between accident-related factors and injury-severity. Data on RTAs that occurred in the city of Cape Town during the period 2015-2017 are used for this study. The data contain the injury-severity of the RTAs as well as several accident related variables. The injury-severity categories of RTAs are classified as: “no injury”, “slight”, “serious” and “fatal” injury. Additional locational and situational variables were added to the dataset. The exploratory analysis revealed that the vast majority of alleged causes (as deduced by the data capturers from the accident report) of RTAs are related to driver/human error, accidents with pedestrians make up only 5.86% of all RTAs yet account for 58.56% of “fatal” accidents and 55.37% of “serious” accidents, the majority of “fatal” and “serious” RTAs occur on the weekend and involve only one vehicle. It was also identified that the RTA data was severely imbalanced with regards to injury-severity. Imbalanced data occur when the number of observations belonging to each of the classification categories are not approximately equal and can negatively affect the performance of classification methods. This paper employed three common approaches to address class imbalance namely (i) under sampling of the majority class, (ii) oversampling of the minority class and (iii) the synthetic minority oversampling technique (SMOTE). The RTA data was split into training, validation and test sets keeping the proportions of the injury-severity category consistent. Four training datasets were analysed: the original imbalanced data, data with the minority class over-sampled, data with the majority class under-sampled and data with synthetically created observations. The performance of the SL methods trained on these four different datasets were compared using accuracy, recall, precision and F1 score as evaluation metrics. All three data sampling methods improved the CT, RF and GBM model's average recall and ability to identify observations belonging to the minority class (“fatal” RTAs). With regards to maximising average recall, the SMOTE technique was the most effective data sampling method to address class imbalance. Further analysis was done to determine whether simple SL methods such as multinomial logistic regression are sufficient to model RTA injury-severity or if more complex SL methods such as ANNs are required. The ANN model achieved a higher average recall and correctly identified more observations belonging to the minority class, “fatal” RTAs, than the multinomial logistic regression model. Using average recall as the main evaluation metric, the ANN was selected as the “best” performing model on the validation data. The ANN model correctly identified a large number of “fatal” RTAs while also resulting in a high number of false positives. The ANN model was very effective at correctly identifying “no injury” RTAs as evidenced by the high recall and precision scores, but performed poorly at correctly identifying “slight” and “serious” RTAs. Finally, the variable importance of the CT, RF and GBM models trained on the SMOTE data revealed the geographical location of an RTA, crash type as well as the number of vehicles involved in an accident to be significant risk factors associated with RTA injury-severity. The CT and RF models both determined the alleged cause of an accident to be significant, while the RF and GBM models determined several weather-related variables to be significant risk factors associated with RTA injury-severity. Future road safety policies should focus on reducing human/driver error, reducing pedestrian-related RTAs and increasing policing efforts over weekends and during poor weather conditions. Road safety policies should take the geographical location of RTAs into account in order to identify high-risk areas for “serious” and “fatal” RTAs.
- ItemOpen AccessHospital readmission risk(2024) Mugova, Amos; Salau, Sulaiman; Er, SebnemHospital readmissions are a significant challenge in healthcare, as they lead to in creased costs, higher risk of mortality, treatment complications, and patient dis tress. This minor dissertation, set within the South African healthcare framework, investigates the potential of both traditional clinical screening tools and advanced statistical learning methods for predicting hospital readmission risk. The meth ods considered include the LACE score, decision trees, logistic regression, random forests, gradient-boosting methods, and neural networks. The study uses data from South Africa's privately insured demographic, provided by a private insurer. It includes a comprehensive array of patient information such as demographics, prescribed medications, medical procedures undergone, and historical hospital usage. Feature selection methods were used to identify relevant variables for model training, and the effectiveness of these variables was assessed based on their ability to differentiate between patients at risk of hospital readmission within 30 days after discharge. The statistical learning methods' efficacy was measured using several performance indicators, such as prediction accuracy, F1 score, Area Under the Receiver Operating Characteristics Curve (AUC), Area Under the Precision-Recall Curve (AUC-PR), and the Matthews Correlation Coefficient (MCC). The study found that the neural network model outperformed the other statistical learning methods evaluated across various metrics. Moreover, the research extends the range of variables used to predict hospital read missions beyond the traditional LACE score, incorporating critical factors such as the frequency and costs of previous hospital visits, expenses related to specialist services, patient age, and the primary diagnosis category.