Statistical model selection techniques for the cox proportional hazards model: a comparative study

Njati, Jolando

Statistical model selection techniques for the cox proportional hazards model: a comparative study

Master Thesis

2022

Abstract

The advancement in data acquiring technology continues to see survival data sets with many covariates. This has posed a new challenge for researchers in identifying important covariates for inference and prediction for a time-to-event response variable. In this dissertation, common Cox proportional hazards model selection techniques and a random survival forest technique were compared using five performance criteria measures. These performance measures were concordance index, integrated area under the curve, and , and R2 . To carry out this exercise, a multicentre clinical trial data set was used. A simulation study was also implemented for this comparison. To develop a Cox proportional model, a training dataset of 75% of the observations was used and the model selection techniques were implemented to select covariates. Full Cox PH models containing all covariates were also incorporated for analysis for both the clinical trial data set and simulations. The clinical trial data set showed that the full model and forward selection technique performed better with the performance metrics employed, though they do not reduce the complexity of the model as much as the Lasso technique does. The simulation studies also showed that the full model performed better than the other techniques, with the Lasso technique overpenalising the model from the simulation with the smaller data set and many covariates. AIC and BIC were less effective in computation than the rest of the variable selection techniques, but effectively reduced model complexity than their counterparts for the simulations. The integrated area under the curve was the performance metric of choice for choosing the final model for analysis on the real data set. This performance metric gave more efficient outcomes unlike the other metrics on all selection techniques. This dissertation hence showed that variable selection techniques differ according to the study design of the research as well as the performance measure used. Hence, to have a good model, it is important to not use a model selection technique in isolation. There is therefore need for further research and publish techniques that work generally well for different study designs to make the process shorter for most researchers.