Abstract

Biological knowledgebases rely on expert biocuration of the research literature to maintain up-to-date collections of data organized in machine-readable form. To enter information into knowledgebases, curators need to follow three steps: (i) identify papers containing relevant data, a process called triaging; (ii) recognize named entities; and (iii) extract and curate data in accordance with the underlying data models. WormBase (WB), the authoritative repository for research data on Caenorhabditis elegans and other nematodes, uses text mining (TM) to semi-automate its curation pipeline. In addition, WB engages its community, via an Author First Pass (AFP) system, to help recognize entities and classify data types in their recently published papers. In this paper, we present a new WB AFP system that combines TM and AFP into a single application to enhance community curation. The system employs string-searching algorithms and statistical methods (e.g. support vector machines (SVMs)) to extract biological entities and classify data types, and it presents the results to authors in a web form where they validate the extracted information, rather than enter it de novo as the previous form required. With this new system, we lessen the burden for authors, while at the same time receive valuable feedback on the performance of our TM tools. The new user interface also links out to specific structured data submission forms, e.g. for phenotype or expression pattern data, giving the authors the opportunity to contribute a more detailed curation that can be incorporated into WB with minimal curator review. Our approach is generalizable and could be applied to additional knowledgebases that would like to engage their user community in assisting with the curation. In the five months succeeding the launch of the new system, the response rate has been comparable with that of the previous AFP version, but the quality and quantity of the data received has greatly improved.

Highlights

  • Biological knowledgebases are daily essential tools for researchers as they provide a point of access to multiple types of genetic and genomic data extracted from the biomedical literature

  • The new WB Author First Pass (AFP) system is organized into three main components (Figure 1): (i) the backend software that periodically retrieves papers from the WB internal PDF repository, converts the PDF to text, extracts and classifies relevant information and sends notifications to corresponding authors; (ii) the AFP form, which presents information extracted from papers to the authors through a web-based user interface; and (iii) the AFP curator dashboard, which allows WB curators to compare extracted and submitted

  • To evaluate the advantages of the new AFP system, and to monitor the quantity and quality of the data submitted by the authors, we compared the data processed by the old and new AFP systems, including the author’s feedback

Read more

Summary

Introduction

Biological knowledgebases are daily essential tools for researchers as they provide a point of access to multiple types of genetic and genomic data extracted from the biomedical literature. In ‘community curation’, authors are engaged to curate at least some of the data in their papers using web interfaces designed to facilitate data entry. The AFP pipeline is the backend of the AFP system and contains its core TM functions It is executed every week, and during each run processes up to 50 newly published ‘primary’ articles (as defined above) obtained from the WB PDF repository. I.e. papers for which the PDF to text conversion module failed (6.5%), are excluded from the pipeline, while all others are passed to the four steps of the pipeline: (i) binary data type classification, (ii) entity lists extraction, (iii) email address extraction and (iv) author notification

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call