Performance analysis of text classification algorithms for PubMed articles

Savvi, Suzana

Performance analysis of text classification algorithms for PubMed articles

dc.contributor.advisor	Bonenkamp, Koen
dc.contributor.advisor	Little, Francesca
dc.contributor.author	Savvi, Suzana
dc.date.accessioned	2022-03-14T05:21:47Z
dc.date.available	2022-03-14T05:21:47Z
dc.date.issued	2021
dc.date.updated	2022-03-14T05:18:11Z
dc.description.abstract	The Medical Subject Headings (MeSH) thesaurus is a controlled vocabulary developed by the US National Library of Medicine (NLM) for indexing articles in Pubmed Central (PMC) archive. The annotation process is a complex and time-consuming task relying on subjective manual assignment of MeSH concepts. Automating such tasks with machine learning may provide a more efficient way of organizing biomedical literature in a less ambiguous way. This research provides a case study which compares the performance of several different machine learning algorithms (Topic Modelling, Random Forest, Logistic Regression, Support Vector Classifiers, Multinomial Naive Bayes, Convolutional Neural Network and Long Short-Term Memory (LSTM)) in reproducing manually assigned MeSH annotations. Records for this study were retrieved from Pubmed using the E-utilities API to the Entrez system of databases at NCBI (National Centre for Biotechnology Information). The MeSH vocabulary is organised in a hierarchical structure and article abstracts labelled with a single MeSH term from the top second two layers were selected for training the machine learning models. Various strategies for text multiclass classification were considered. One was a Chi-square test for feature selection which identified words relevant to each MeSH label. The second approach used Named Entity Recognition (NER) to extract entities from the unstructured text and another approach relied on word embeddings able to capture latent knowledge from literature. At the start of the study text was tokenised using the Term Frequency Inverse Document Frequency (Tf-idf) technique and topic modelling performed with the objective to ascertain the correlation between assigned topics (unsupervised learning task) and MeSH terms in PubMed. Findings revealed the degree of coupling was low although significant. Of all of the classifier models trained, logistic regression on Tf-idf vectorised entities achieved highest accuracy. Performance varied across the different MeSH categories. In conclusion automated curation of articles by abstract may be possible for those target classes classified reliably and reproducibly.
dc.identifier.apacitation	Savvi, S. (2021). <i>Performance analysis of text classification algorithms for PubMed articles</i>. (). ,Faculty of Science ,Department of Statistical Sciences. Retrieved from http://hdl.handle.net/11427/36059	en_ZA
dc.identifier.chicagocitation	Savvi, Suzana. <i>"Performance analysis of text classification algorithms for PubMed articles."</i> ., ,Faculty of Science ,Department of Statistical Sciences, 2021. http://hdl.handle.net/11427/36059	en_ZA
dc.identifier.citation	Savvi, S. 2021. Performance analysis of text classification algorithms for PubMed articles. . ,Faculty of Science ,Department of Statistical Sciences. http://hdl.handle.net/11427/36059	en_ZA
dc.identifier.ris	TY - Master Thesis AU - Savvi, Suzana AB - The Medical Subject Headings (MeSH) thesaurus is a controlled vocabulary developed by the US National Library of Medicine (NLM) for indexing articles in Pubmed Central (PMC) archive. The annotation process is a complex and time-consuming task relying on subjective manual assignment of MeSH concepts. Automating such tasks with machine learning may provide a more efficient way of organizing biomedical literature in a less ambiguous way. This research provides a case study which compares the performance of several different machine learning algorithms (Topic Modelling, Random Forest, Logistic Regression, Support Vector Classifiers, Multinomial Naive Bayes, Convolutional Neural Network and Long Short-Term Memory (LSTM)) in reproducing manually assigned MeSH annotations. Records for this study were retrieved from Pubmed using the E-utilities API to the Entrez system of databases at NCBI (National Centre for Biotechnology Information). The MeSH vocabulary is organised in a hierarchical structure and article abstracts labelled with a single MeSH term from the top second two layers were selected for training the machine learning models. Various strategies for text multiclass classification were considered. One was a Chi-square test for feature selection which identified words relevant to each MeSH label. The second approach used Named Entity Recognition (NER) to extract entities from the unstructured text and another approach relied on word embeddings able to capture latent knowledge from literature. At the start of the study text was tokenised using the Term Frequency Inverse Document Frequency (Tf-idf) technique and topic modelling performed with the objective to ascertain the correlation between assigned topics (unsupervised learning task) and MeSH terms in PubMed. Findings revealed the degree of coupling was low although significant. Of all of the classifier models trained, logistic regression on Tf-idf vectorised entities achieved highest accuracy. Performance varied across the different MeSH categories. In conclusion automated curation of articles by abstract may be possible for those target classes classified reliably and reproducibly. DA - 2021 DB - OpenUCT DP - University of Cape Town KW - Statistical Sciences LK - https://open.uct.ac.za PY - 2021 T1 - Performance analysis of text classification algorithms for PubMed articles TI - Performance analysis of text classification algorithms for PubMed articles UR - http://hdl.handle.net/11427/36059 ER -	en_ZA
dc.identifier.uri	http://hdl.handle.net/11427/36059
dc.identifier.vancouvercitation	Savvi S. Performance analysis of text classification algorithms for PubMed articles. []. ,Faculty of Science ,Department of Statistical Sciences, 2021 [cited yyyy month dd]. Available from: http://hdl.handle.net/11427/36059	en_ZA
dc.language.rfc3066	eng
dc.publisher.department	Department of Statistical Sciences
dc.publisher.faculty	Faculty of Science
dc.subject	Statistical Sciences
dc.title	Performance analysis of text classification algorithms for PubMed articles
dc.type	Master Thesis
dc.type.qualificationlevel	Masters
dc.type.qualificationlevel	MSc

Files

Original bundle

Now showing 1 - 1 of 1

Name:: thesis_sci_2021_savvi suzana.pdf
Size:: 10.99 MB
Format:: Adobe Portable Document Format
Description:

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 0 B
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Masters