Abstract

In model organism databases, one of the important tasks is to convert free text in biomedical literature to a structured data format. Curators in the Rat Genome Database (RGD), the primary source of rat genomic, genetic, and physiological data, spend considerable time and effort curating functional information for genes, QTLs, and strains from the literature. To increase curation efficiency and prioritize literature for data extraction OntoMate was developed at RGD. This tool tags Pubmed abstracts with genes, gene names, gene mutations, organism name and terms from 16 ontologies/vocabularies, including synonyms and aliases, used to represent functional information. In this project, we have used an unsupervised tagging method to reduce human effort for creating training data. In this approach, a machine learning tool based on decision tree classification techniques has been developed. Mentions that are uniquely belong to a semantic type play positive sample roles, and those with semantic types other than desired group are assumed to be negative samples. An interface allows the user to create a complex query incorporating terms from any of the ontologies, gene symbols, organisms, dates and other parameters. The results return abstracts along with all tagged parameters indicated in the query, along with children of the ontology terms chosen. Results can be further filtered by the user through a panel that lists organisms, genes and diseases with number of paper returned. Abstracts and papers are provided in rank order by relevance to the query. The tool is fully integrated into curation software so citations and abstracts can be automatically entered into the RGD database and given ID and genes and ontology terms in the tags can be checked to create annotations linked to the paper. The system was built with a scalable and open architecture, and literature is updated daily. This tool uses Solr indexing technology and categorizes papers based on a relevance score. It indexes and tags more than 27 million abstracts. With the use of bioNLP tools, RGD has added more automation to its curation workflow.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call