Hospital readmission prediction with long clinical notes

dc.contributor.advisorBuys, Jan
dc.contributor.authorNurmahomed, Yassin
dc.date.accessioned2023-04-13T10:49:24Z
dc.date.available2023-04-13T10:49:24Z
dc.date.issued2022
dc.date.updated2023-04-12T11:29:48Z
dc.description.abstractElectronic health records (EHR) data is captured across many healthcare institutions, resulting in large amounts of diverse information that can be analysed for diagnosis, prognosis, treatment and prevention of disease. One type of data captured by EHRs are clinical notes, which are unstructured data written in natural language. We can leverage Natural Language Processing (NLP) to build machine learning (ML) models to gain understanding from clinical notes that will enable us to predict clinical outcomes. ClinicalBERT is a pre-trained Transformer based model which is trained on clinical text and is able to predict 30-day hospital readmission from clinical notes. Although the performance is good, it suffers from a limitation on the size of the text sequence that is fed as input to the model. Models using longer sequences have been shown to perform better on different ML tasks, even with clinical text. In this work, a ML model called Longformer which pre-trained then fine-tuned on clinical text and is able to learn from longer sequences than previous models is evaluated. Performance is evaluated against the Deep Averaging Network (DAN) and Long short-term memory (LSTM) baselines and previous state-of-the-art models in terms of Area under the receiver operating characteristic curve (AUROC), Area under the precision-recall curve (AUPRC) and Recall at precision of 70% (RP70). Longformer is able to best ClinicalBERT on two performance metrics, however it is not able to surpass one of the baselines in any of the metrics. Training the model on early notes did not result in substantial difference when compared to training on discharge summaries. Our analysis shows that the model suffers from out-of-vocabulary words, as many biomedical concepts are missing from the original pre-training corpus.
dc.identifier.apacitationNurmahomed, Y. (2022). <i>Hospital readmission prediction with long clinical notes</i>. (). ,Faculty of Science ,Department of Computer Science. Retrieved from http://hdl.handle.net/11427/37712en_ZA
dc.identifier.chicagocitationNurmahomed, Yassin. <i>"Hospital readmission prediction with long clinical notes."</i> ., ,Faculty of Science ,Department of Computer Science, 2022. http://hdl.handle.net/11427/37712en_ZA
dc.identifier.citationNurmahomed, Y. 2022. Hospital readmission prediction with long clinical notes. . ,Faculty of Science ,Department of Computer Science. http://hdl.handle.net/11427/37712en_ZA
dc.identifier.ris TY - Master Thesis AU - Nurmahomed, Yassin AB - Electronic health records (EHR) data is captured across many healthcare institutions, resulting in large amounts of diverse information that can be analysed for diagnosis, prognosis, treatment and prevention of disease. One type of data captured by EHRs are clinical notes, which are unstructured data written in natural language. We can leverage Natural Language Processing (NLP) to build machine learning (ML) models to gain understanding from clinical notes that will enable us to predict clinical outcomes. ClinicalBERT is a pre-trained Transformer based model which is trained on clinical text and is able to predict 30-day hospital readmission from clinical notes. Although the performance is good, it suffers from a limitation on the size of the text sequence that is fed as input to the model. Models using longer sequences have been shown to perform better on different ML tasks, even with clinical text. In this work, a ML model called Longformer which pre-trained then fine-tuned on clinical text and is able to learn from longer sequences than previous models is evaluated. Performance is evaluated against the Deep Averaging Network (DAN) and Long short-term memory (LSTM) baselines and previous state-of-the-art models in terms of Area under the receiver operating characteristic curve (AUROC), Area under the precision-recall curve (AUPRC) and Recall at precision of 70% (RP70). Longformer is able to best ClinicalBERT on two performance metrics, however it is not able to surpass one of the baselines in any of the metrics. Training the model on early notes did not result in substantial difference when compared to training on discharge summaries. Our analysis shows that the model suffers from out-of-vocabulary words, as many biomedical concepts are missing from the original pre-training corpus. DA - 2022_ DB - OpenUCT DP - University of Cape Town KW - Computer Science LK - https://open.uct.ac.za PY - 2022 T1 - Hospital readmission prediction with long clinical notes TI - Hospital readmission prediction with long clinical notes UR - http://hdl.handle.net/11427/37712 ER - en_ZA
dc.identifier.urihttp://hdl.handle.net/11427/37712
dc.identifier.vancouvercitationNurmahomed Y. Hospital readmission prediction with long clinical notes. []. ,Faculty of Science ,Department of Computer Science, 2022 [cited yyyy month dd]. Available from: http://hdl.handle.net/11427/37712en_ZA
dc.language.rfc3066eng
dc.publisher.departmentDepartment of Computer Science
dc.publisher.facultyFaculty of Science
dc.subjectComputer Science
dc.titleHospital readmission prediction with long clinical notes
dc.typeMaster Thesis
dc.type.qualificationlevelMasters
dc.type.qualificationlevelMSc
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
thesis_sci_2022_nurmahomed yassin.pdf
Size:
6.27 MB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
0 B
Format:
Item-specific license agreed upon to submission
Description:
Collections