MScanner: a classifier for retrieving Medline citations

dc.contributor.authorPoulter, Grahamen_ZA
dc.contributor.authorRubin, Danielen_ZA
dc.contributor.authorAltman, Russen_ZA
dc.contributor.authorSeoighe, Cathalen_ZA
dc.date.accessioned2015-10-28T07:03:45Z
dc.date.available2015-10-28T07:03:45Z
dc.date.issued2008en_ZA
dc.description.abstractBACKGROUND: Keyword searching through PubMed and other systems is the standard means of retrieving information from Medline. However, ad-hoc retrieval systems do not meet all of the needs of databases that curate information from literature, or of text miners developing a corpus on a topic that has many terms indicative of relevance. Several databases have developed supervised learning methods that operate on a filtered subset of Medline, to classify Medline records so that fewer articles have to be manually reviewed for relevance. A few studies have considered generalisation of Medline classification to operate on the entire Medline database in a non-domain-specific manner, but existing applications lack speed, available implementations, or a means to measure performance in new domains. RESULTS: MScanner is an implementation of a Bayesian classifier that provides a simple web interface for submitting a corpus of relevant training examples in the form of PubMed IDs and returning results ranked by decreasing probability of relevance. For maximum speed it uses the Medical Subject Headings (MeSH) and journal of publication as a concise document representation, and takes roughly 90 seconds to return results against the 16 million records in Medline. The web interface provides interactive exploration of the results, and cross validated performance evaluation on the relevant input against a random subset of Medline. We describe the classifier implementation, cross validate it on three domain-specific topics, and compare its performance to that of an expert PubMed query for a complex topic. In cross validation on the three sample topics against 100,000 random articles, the classifier achieved excellent separation of relevant and irrelevant article score distributions, ROC areas between 0.97 and 0.99, and averaged precision between 0.69 and 0.92. CONCLUSION: MScanner is an effective non-domain-specific classifier that operates on the entire Medline database, and is suited to retrieving topics for which many features may indicate relevance. Its web interface simplifies the task of classifying Medline citations, compared to building a pre-filter and classifier specific to the topic. The data sets and open source code used to obtain the results in this paper are available on-line and as supplementary material, and the web interface may be accessed at http://mscanner.stanford.edu.en_ZA
dc.identifier.apacitationPoulter, G., Rubin, D., Altman, R., & Seoighe, C. (2008). MScanner: a classifier for retrieving Medline citations. <i>BMC Bioinformatics</i>, http://hdl.handle.net/11427/14466en_ZA
dc.identifier.chicagocitationPoulter, Graham, Daniel Rubin, Russ Altman, and Cathal Seoighe "MScanner: a classifier for retrieving Medline citations." <i>BMC Bioinformatics</i> (2008) http://hdl.handle.net/11427/14466en_ZA
dc.identifier.citationPoulter, G. L., Rubin, D. L., Altman, R. B., & Seoighe, C. (2008). MScanner: a classifier for retrieving Medline citations. BMC bioinformatics, 9(1), 108.en_ZA
dc.identifier.ris TY - Journal Article AU - Poulter, Graham AU - Rubin, Daniel AU - Altman, Russ AU - Seoighe, Cathal AB - BACKGROUND: Keyword searching through PubMed and other systems is the standard means of retrieving information from Medline. However, ad-hoc retrieval systems do not meet all of the needs of databases that curate information from literature, or of text miners developing a corpus on a topic that has many terms indicative of relevance. Several databases have developed supervised learning methods that operate on a filtered subset of Medline, to classify Medline records so that fewer articles have to be manually reviewed for relevance. A few studies have considered generalisation of Medline classification to operate on the entire Medline database in a non-domain-specific manner, but existing applications lack speed, available implementations, or a means to measure performance in new domains. RESULTS: MScanner is an implementation of a Bayesian classifier that provides a simple web interface for submitting a corpus of relevant training examples in the form of PubMed IDs and returning results ranked by decreasing probability of relevance. For maximum speed it uses the Medical Subject Headings (MeSH) and journal of publication as a concise document representation, and takes roughly 90 seconds to return results against the 16 million records in Medline. The web interface provides interactive exploration of the results, and cross validated performance evaluation on the relevant input against a random subset of Medline. We describe the classifier implementation, cross validate it on three domain-specific topics, and compare its performance to that of an expert PubMed query for a complex topic. In cross validation on the three sample topics against 100,000 random articles, the classifier achieved excellent separation of relevant and irrelevant article score distributions, ROC areas between 0.97 and 0.99, and averaged precision between 0.69 and 0.92. CONCLUSION: MScanner is an effective non-domain-specific classifier that operates on the entire Medline database, and is suited to retrieving topics for which many features may indicate relevance. Its web interface simplifies the task of classifying Medline citations, compared to building a pre-filter and classifier specific to the topic. The data sets and open source code used to obtain the results in this paper are available on-line and as supplementary material, and the web interface may be accessed at http://mscanner.stanford.edu. DA - 2008 DB - OpenUCT DO - 10.1186/1471-2105-9-108 DP - University of Cape Town J1 - BMC Bioinformatics LK - https://open.uct.ac.za PB - University of Cape Town PY - 2008 T1 - MScanner: a classifier for retrieving Medline citations TI - MScanner: a classifier for retrieving Medline citations UR - http://hdl.handle.net/11427/14466 ER - en_ZA
dc.identifier.urihttp://hdl.handle.net/11427/14466
dc.identifier.urihttp://dx.doi.org/10.1186/1471-2105-9-108
dc.identifier.vancouvercitationPoulter G, Rubin D, Altman R, Seoighe C. MScanner: a classifier for retrieving Medline citations. BMC Bioinformatics. 2008; http://hdl.handle.net/11427/14466.en_ZA
dc.language.isoengen_ZA
dc.publisherBioMed Central Ltden_ZA
dc.publisher.departmentDepartment of Molecular and Cell Biologyen_ZA
dc.publisher.facultyFaculty of Scienceen_ZA
dc.publisher.institutionUniversity of Cape Town
dc.rightsThis is an Open Access article distributed under the terms of the Creative Commons Attribution Licenseen_ZA
dc.rights.holder2008 Poulter et al; licensee BioMed Central Ltd.en_ZA
dc.rights.urihttp://creativecommons.org/licenses/by/2.0en_ZA
dc.sourceBMC Bioinformaticsen_ZA
dc.source.urihttp://www.biomedcentral.com/bmcbioinformatics/en_ZA
dc.subject.otherBioinformaticsen_ZA
dc.titleMScanner: a classifier for retrieving Medline citationsen_ZA
dc.typeJournal Articleen_ZA
uct.type.filetypeText
uct.type.filetypeImage
uct.type.publicationResearchen_ZA
uct.type.resourceArticleen_ZA
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Poulter_MScanner_retrieving_Medline_citations_2008.pdf
Size:
413.99 KB
Format:
Adobe Portable Document Format
Description:
Collections