Abstract

BackgroundSynthesis of data from published human genetic association studies is a critical step in the translation of human genome discoveries into health applications. Although genetic association studies account for a substantial proportion of the abstracts in PubMed, identifying them with standard queries is not always accurate or efficient. Further automating the literature-screening process can reduce the burden of a labor-intensive and time-consuming traditional literature search. The Support Vector Machine (SVM), a well-established machine learning technique, has been successful in classifying text, including biomedical literature. The GAPscreener, a free SVM-based software tool, can be used to assist in screening PubMed abstracts for human genetic association studies.ResultsThe data source for this research was the HuGE Navigator, formerly known as the HuGE Pub Lit database. Weighted SVM feature selection based on a keyword list obtained by the two-way z score method demonstrated the best screening performance, achieving 97.5% recall, 98.3% specificity and 31.9% precision in performance testing. Compared with the traditional screening process based on a complex PubMed query, the SVM tool reduced by about 90% the number of abstracts requiring individual review by the database curator. The tool also ascertained 47 articles that were missed by the traditional literature screening process during the 4-week test period. We examined the literature on genetic associations with preterm birth as an example. Compared with the traditional, manual process, the GAPscreener both reduced effort and improved accuracy.ConclusionGAPscreener is the first free SVM-based application available for screening the human genetic association literature in PubMed with high recall and specificity. The user-friendly graphical user interface makes this a practical, stand-alone application. The software can be downloaded at no charge.

Highlights

  • Synthesis of data from published human genetic association studies is a critical step in the translation of human genome discoveries into health applications

  • We report a novel method for feature selection and show that using it to train the Support Vector Machine (SVM) model significantly improved its ability to classify reports of human genetic association studies

  • PubMed abstract text retrieval We developed a PubMed text extraction tool using the NCBI E-utility [20] to retrieve text content based on PubMed identification numbers (PMIDs)

Read more

Summary

Introduction

Synthesis of data from published human genetic association studies is a critical step in the translation of human genome discoveries into health applications. Further automating the literaturescreening process can reduce the burden of a labor-intensive and time-consuming traditional literature search. The peer-reviewed scientific literature is a major source of information for developing research hypotheses and creating new knowledge through synthesis of research findings [1]. Human genetic association studies epitomize this challenge because they have proliferated rapidly since completion of the Human Genome Project [2]. Many databases that deposit genetic association information, including citations from PubMed, have been built and curated [5,6,7]. PubMed [8] is the largest publicly accessible biomedical literature database and is the main source for such activities. The necessarily labor-intensive screening and curation process makes the maintenance of such databases extremely challenging

Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call