Abstract

Short paragraphs that describe gene function, referred to as gene summaries, are valued by users of biological knowledgebases for the ease with which they convey key aspects of gene function. Manual curation of gene summaries, while desirable, is difficult for knowledgebases to sustain. We developed an algorithm that uses curated, structured gene data at the Alliance of Genome Resources (Alliance; www.alliancegenome.org) to automatically generate gene summaries that simulate natural language. The gene data used for this purpose include curated associations (annotations) to ontology terms from the Gene Ontology, Disease Ontology, model organism knowledgebase (MOK)-specific anatomy ontologies and Alliance orthology data. The method uses sentence templates for each data category included in the gene summary in order to build a natural language sentence from the list of terms associated with each gene. To improve readability of the summaries when numerous gene annotations are present, we developed a new algorithm that traverses ontology graphs in order to group terms by their common ancestors. The algorithm optimizes the coverage of the initial set of terms and limits the length of the final summary, using measures of information content of each ontology term as a criterion for inclusion in the summary. The automated gene summaries are generated with each Alliance release, ensuring that they reflect current data at the Alliance. Our method effectively leverages category-specific curation efforts of the Alliance member databases to create modular, structured and standardized gene summaries for seven member species of the Alliance. These automatically generated gene summaries make cross-species gene function comparisons tenable and increase discoverability of potential models of human disease. In addition to being displayed on Alliance gene pages, these summaries are also included on several MOK gene pages.

Highlights

  • Often, a key task for biological and biomedical knowledgebases is the summarization of the knowledge about a gene

  • We compared the gene summaries obtained by the different trimming algorithms, using data from Alliance v.2.3.0

  • We report the specificity of the final terms as a measure of both their depth in the ontology and their ICcorpus value

Read more

Summary

Introduction

A key task for biological and biomedical knowledgebases is the summarization of the knowledge about a gene. A textual gene summary is user-friendly, requiring no knowledge of controlled vocabularies such as ontologies, or of knowledgebase-specific data models. Summarizing the knowledge about a gene is usually done by writing a brief text summary that condenses and arranges all of the current knowledge about that gene into several data categories. These categories often include molecular function (MF) or activity, biological processes (BPs) in which the gene is involved, orthology/homology data and gene expression at the tissue and subcellular levels. Curators may add additional gene information such as pathway data, genetic and physical interactions, drug interactions and regulation of gene expression and activity

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.