Database Curation Research Articles

The practice of precision medicine will ultimately require databases of genes and mutations for healthcare providers to reference in order to understand the clinical implications of each patient’s genetic makeup. Although the highest quality databases require manual curation, text mining tools can facilitate the curation process, increasing accuracy, coverage, and productivity. However, to date there are no available text mining tools that offer high-accuracy performance for extracting such triplets from biomedical literature. In this paper we propose a high-performance machine learning approach to automate the extraction of disease-gene-variant triplets from biomedical literature. Our approach is unique because we identify the genes and protein products associated with each mutation from not just the local text content, but from a global context as well (from the Internet and from all literature in PubMed). Our approach also incorporates protein sequence validation and disease association using a novel text-mining-based machine learning approach. We extract disease-gene-variant triplets from all abstracts in PubMed related to a set of ten important diseases (breast cancer, prostate cancer, pancreatic cancer, lung cancer, acute myeloid leukemia, Alzheimer’s disease, hemochromatosis, age-related macular degeneration (AMD), diabetes mellitus, and cystic fibrosis). We then evaluate our approach in two ways: (1) a direct comparison with the state of the art using benchmark datasets; (2) a validation study comparing the results of our approach with entries in a popular human-curated database (UniProt) for each of the previously mentioned diseases. In the benchmark comparison, our full approach achieves a 28% improvement in F1-measure (from 0.62 to 0.79) over the state-of-the-art results. For the validation study with UniProt Knowledgebase (KB), we present a thorough analysis of the results and errors. Across all diseases, our approach returned 272 triplets (disease-gene-variant) that overlapped with entries in UniProt and 5,384 triplets without overlap in UniProt. Analysis of the overlapping triplets and of a stratified sample of the non-overlapping triplets revealed accuracies of 93% and 80% for the respective categories (cumulative accuracy, 77%). We conclude that our process represents an important and broadly applicable improvement to the state of the art for curation of disease-gene-variant relationships.

BackgroundMicroRNAs are increasingly being appreciated as critical players in human diseases, and questions concerning the role of microRNAs arise in many areas of biomedical research. There are several manually curated databases of microRNA-disease associations gathered from the biomedical literature; however, it is difficult for curators of these databases to keep up with the explosion of publications in the microRNA-disease field. Moreover, automated literature mining tools that assist manual curation of microRNA-disease associations currently capture only one microRNA property (expression) in the context of one disease (cancer). Thus, there is a clear need to develop more sophisticated automated literature mining tools that capture a variety of microRNA properties and relations in the context of multiple diseases to provide researchers with fast access to the most recent published information and to streamline and accelerate manual curation.MethodsWe have developed miRiaD (microRNAs in association with Disease), a text-mining tool that automatically extracts associations between microRNAs and diseases from the literature. These associations are often not directly linked, and the intermediate relations are often highly informative for the biomedical researcher. Thus, miRiaD extracts the miR-disease pairs together with an explanation for their association. We also developed a procedure that assigns scores to sentences, marking their informativeness, based on the microRNA-disease relation observed within the sentence.ResultsmiRiaD was applied to the entire Medline corpus, identifying 8301 PMIDs with miR-disease associations. These abstracts and the miR-disease associations are available for browsing at http://biotm.cis.udel.edu/miRiaD. We evaluated the recall and precision of miRiaD with respect to information of high interest to public microRNA-disease database curators (expression and target gene associations), obtaining a recall of 88.46–90.78. When we expanded the evaluation to include sentences with a wide range of microRNA-disease information that may be of interest to biomedical researchers, miRiaD also performed very well with a F-score of 89.4. The informativeness ranking of sentences was evaluated in terms of nDCG (0.977) and correlation metrics (0.678-0.727) when compared to an annotator’s ranked list.ConclusionsmiRiaD, a high performance system that can capture a wide variety of microRNA-disease related information, extends beyond the scope of existing microRNA-disease resources. It can be incorporated into manual curation pipelines and serve as a resource for biomedical researchers interested in the role of microRNAs in disease. In our ongoing work we are developing an improved miRiaD web interface that will facilitate complex queries about microRNA-disease relationships, such as “In what diseases does microRNA regulation of apoptosis play a role?” or “Is there overlap in the sets of genes targeted by microRNAs in different types of dementia?”.”Electronic supplementary materialThe online version of this article (doi:10.1186/s13326-015-0044-y) contains supplementary material, which is available to authorized users.

Database Curation Research Articles

Related Topics

Articles published on Database Curation

Effective biomedical document classification for identifying publications relevant to the mouse Gene Expression Database (GXD)

A text mining-based approach to graph database curation in support of metabolic pathway reconstruction

The Israeli National Genetic database: a 10-year experience

Rapid development of entity-based data models for bioinformatics with persistence object-oriented design and structured interfaces

Effective biomedical document classification for identifying publications relevant to the mouse Gene Expression Database (GXD).

SciLite: a platform for displaying text-mined annotations as a means to link research articles with biological data

Text Mining Genotype-Phenotype Relationships from Biomedical Literature for Database Curation and Precision Medicine.

Recognition of side effects as implicit-opinion words in drug reviews

Text Mining to Support Gene Ontology Curation and Vice Versa.

Filtering large-scale event collections using a combination of supervised and unsupervised learning for event trigger classification.

MiRiaD: A Text Mining Tool for Detecting Associations of microRNAs with Diseases.

Text mining for precision medicine: automating disease-mutation relationship extraction from biomedical literature.

Intelligent Retrieval for Biodiversity

Discovering biomedical semantic relations in PubMed queries for information retrieval and database curation.

NTTMUNSW BioC modules for recognizing and normalizing species and gene/protein mentions

BioCreative V CDR task corpus: a resource for chemical disease relation extraction

NeXtA5: accelerating annotation of articles via automated approaches in neXtProt.

Crowd-sourcing and author submission as alternatives to professional curation.

Text Mining for Precision Medicine: Bringing Structure to EHRs and Biomedical Literature to Understand Genes and Health.

Ambiguity of non-systematic chemical identifiers within and between small-molecule databases.

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Database Curation Research Articles

Related Topics

Articles published on Database Curation

Effective biomedical document classification for identifying publications relevant to the mouse Gene Expression Database (GXD)

A text mining-based approach to graph database curation in support of metabolic pathway reconstruction

The Israeli National Genetic database: a 10-year experience

Rapid development of entity-based data models for bioinformatics with persistence object-oriented design and structured interfaces

Effective biomedical document classification for identifying publications relevant to the mouse Gene Expression Database (GXD).

SciLite: a platform for displaying text-mined annotations as a means to link research articles with biological data

Text Mining Genotype-Phenotype Relationships from Biomedical Literature for Database Curation and Precision Medicine.

Recognition of side effects as implicit-opinion words in drug reviews

Text Mining to Support Gene Ontology Curation and Vice Versa.

Filtering large-scale event collections using a combination of supervised and unsupervised learning for event trigger classification.

MiRiaD: A Text Mining Tool for Detecting Associations of microRNAs with Diseases.

Text mining for precision medicine: automating disease-mutation relationship extraction from biomedical literature.

Intelligent Retrieval for Biodiversity

Discovering biomedical semantic relations in PubMed queries for information retrieval and database curation.

NTTMUNSW BioC modules for recognizing and normalizing species and gene/protein mentions

BioCreative V CDR task corpus: a resource for chemical disease relation extraction

NeXtA5: accelerating annotation of articles via automated approaches in neXtProt.

Crowd-sourcing and author submission as alternatives to professional curation.

Text Mining for Precision Medicine: Bringing Structure to EHRs and Biomedical Literature to Understand Genes and Health.

Ambiguity of non-systematic chemical identifiers within and between small-molecule databases.