Abstract
This study proposes a text similarity model to help biocuration efforts of the Conserved Domain Database (CDD). CDD is a curated resource that catalogs annotated multiple sequence alignment models for ancient domains and full-length proteins. These models allow for fast searching and quick identification of conserved motifs in protein sequences via Reverse PSI-BLAST. In addition, CDD curators prepare summaries detailing the function of these conserved domains and specific protein families, based on published peer-reviewed articles. To facilitate information access for database users, it is desirable to specifically identify the referenced articles that support the assertions of curator-composed sentences. Moreover, CDD curators desire an alert system that scans the newly published literature and proposes related articles of relevance to the existing CDD records. Our approach to address these needs is a text similarity method that automatically maps a curator-written statement to candidate sentences extracted from the list of referenced articles, as well as the articles in the PubMed Central database. To evaluate this proposal, we paired CDD description sentences with the top 10 matching sentences from the literature, which were given to curators for review. Through this exercise, we discovered that we were able to map the articles in the reference list to the CDD description statements with an accuracy of 77%. In the dataset that was reviewed by curators, we were able to successfully provide references for 86% of the curator statements. In addition, we suggested new articles for curator review, which were accepted by curators to be added into the reference list at an acceptance rate of 50%. Through this process, we developed a substantial corpus of similar sentences from biomedical articles on protein sequence, structure and function research, which constitute the CDD text similarity corpus. This corpus contains 5159 sentence pairs judged for their similarity on a scale from 1 (low) to 5 (high) doubly annotated by four CDD curators. Curator-assigned similarity scores have a Pearson correlation coefficient of 0.70 and an inter-annotator agreement of 85%. To date, this is the largest biomedical text similarity resource that has been manually judged, evaluated and made publicly available to the community to foster research and development of text similarity algorithms.
Highlights
Text mining has been established as a necessary tool to help improve knowledge reusability through improved data access, representation and curation [1, 2]
Conserved Domain Database (CDD) content includes domain models curated by the CDD professional curators at NCBI, as well as models imported from external source databases
We formulate the problem as a text similarity retrieval problem, and we describe our study with these specific contributions: 1) A method that maps the references attached to a given CDD record summary to the correct sentences related to them in the summary, for better information access; 2) A method that discovers new relevant PubMed articles for a given CDD record summary for curator review; and
Summary
Text mining has been established as a necessary tool to help improve knowledge reusability through improved data access, representation and curation [1, 2]. CDD content includes domain models curated by the CDD professional curators at NCBI, as well as models imported from external source databases. CDDcurated models use 3D-structure information to explicitly define domain boundaries and provide insights into sequence, structure and function relationships. These manually curated records are integrated within the NCBI’s search and retrieval system and are cross-linked with other databases such as Gene, 3D-structure, PubMed and PubChem. CDD at NCBI is a professionally annotated resource that catalogs multiple sequence alignment models for proteins, which are available as position-specific score matrices to allow for fast identification of conserved domains in protein sequences via Reverse PSI-BLAST. CDD curators are highly trained domain-expert professionals They annotate functional sites, which can be mapped onto protein (query) sequences. Conserved sequence patterns have been recorded for 2123 of these site annotations, and their mapping onto query sequences is contingent on pattern matches
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have