Abstract
BackgroundIdentifying relevant studies for inclusion in a systematic review (i.e. screening) is a complex, laborious and expensive task. Recently, a number of studies has shown that the use of machine learning and text mining methods to automatically identify relevant studies has the potential to drastically decrease the workload involved in the screening phase. The vast majority of these machine learning methods exploit the same underlying principle, i.e. a study is modelled as a bag-of-words (BOW).MethodsWe explore the use of topic modelling methods to derive a more informative representation of studies. We apply Latent Dirichlet allocation (LDA), an unsupervised topic modelling approach, to automatically identify topics in a collection of studies. We then represent each study as a distribution of LDA topics. Additionally, we enrich topics derived using LDA with multi-word terms identified by using an automatic term recognition (ATR) tool. For evaluation purposes, we carry out automatic identification of relevant studies using support vector machine (SVM)-based classifiers that employ both our novel topic-based representation and the BOW representation.ResultsOur results show that the SVM classifier is able to identify a greater number of relevant studies when using the LDA representation than the BOW representation. These observations hold for two systematic reviews of the clinical domain and three reviews of the social science domain.ConclusionsA topic-based feature representation of documents outperforms the BOW representation when applied to the task of automatic citation screening. The proposed term-enriched topics are more informative and less ambiguous to systematic reviewers.Electronic supplementary materialThe online version of this article (doi:10.1186/s13643-015-0117-0) contains supplementary material, which is available to authorized users.
Highlights
IntroductionIdentifying relevant studies for inclusion in a systematic review (i.e. screening) is a complex, laborious and expensive task
Identifying relevant studies for inclusion in a systematic review is a complex, laborious and expensive task
The datasets were used as the basis for the intrinsic evaluation of the different text classification methods
Summary
Identifying relevant studies for inclusion in a systematic review (i.e. screening) is a complex, laborious and expensive task. The screening phase of systematic reviews aims to identify citations relevant to a research topic, according to a certain pre-defined protocol [1,2,3,4] known as the Population, the Intervention, the Comparator and the Outcome (PICO) framework. The number of relevant citations is usually significantly lower than the number of the irrelevant, which means that reviewers have to deal with an extremely imbalanced datasets To overcome these limitations, methods such as machine learning, text mining [9, 10], text classification [11] and active learning [6, 12] have been used to partially automate this process, in order to reduce the workload, without sacrificing the quality of the reviews.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.