Abstract

The un-biased and reproducible interpretation of high-content gene sets from large-scale genomic experiments is crucial to the understanding of biological themes, validation of experimental data, and the eventual development of plans for future experimentation. To derive biomedically-relevant information from simple gene lists, a mathematical association to scientific language and meaningful words or sentences is crucial. Unfortunately, existing software for deriving meaningful and easily-appreciable scientific textual ‘tokens’ from large gene sets either rely on controlled vocabularies (Medical Subject Headings, Gene Ontology, BioCarta) or employ Boolean text searching and co-occurrence models that are incapable of detecting indirect links in the literature. As an improvement to existing web-based informatic tools, we have developed Textrous!, a web-based framework for the extraction of biomedical semantic meaning from a given input gene set of arbitrary length. Textrous! employs natural language processing techniques, including latent semantic indexing (LSI), sentence splitting, word tokenization, parts-of-speech tagging, and noun-phrase chunking, to mine MEDLINE abstracts, PubMed Central articles, articles from the Online Mendelian Inheritance in Man (OMIM), and Mammalian Phenotype annotation obtained from Jackson Laboratories. Textrous! has the ability to generate meaningful output data with even very small input datasets, using two different text extraction methodologies (collective and individual) for the selecting, ranking, clustering, and visualization of English words obtained from the user data. Textrous!, therefore, is able to facilitate the output of quantitatively significant and easily appreciable semantic words and phrases linked to both individual gene and batch genomic data.

Highlights

  • With the increasing experimental prevalence of high-throughput genomic technologies, researchers are often challenged with the task of selecting, analyzing, clustering, and interpreting lists of functionally-relevant genes to a particular experiment at hand [1]

  • Bridging the gap between large gene sets and the English language is potentially valuable for a variety of applications, including the discovery of previously unknown biological connections, identification of potential research topics, visualization of biological themes, discrimination between specific data sets, and validation of existing data

  • Current software for the interpretation of highthroughput genomic data share one or more of the following characteristics: reliance on controlled-languages (Gene Ontology (GO), Medical Subject Headings (MeSH), BioCarta, Kyoto Encyclopedia of Genes and Genomes (KEGG)), inability to search more than a few genes, and use of standard Boolean and cooccurrence models [6,7,8,9]

Read more

Summary

Introduction

With the increasing experimental prevalence of high-throughput genomic technologies, researchers are often challenged with the task of selecting, analyzing, clustering, and interpreting lists of functionally-relevant genes to a particular experiment at hand [1]. Given that an abundance of information about individual genes is contained in the text of published literature, with the recent development of novel informatic procedures literature mining with natural language processing techniques has become much more fruitful in recent years [2] Current developments in this emerging field include literature-based methods for determining the functional coherence of a gene set, generating related transcription factors from microarray derived gene sets, and the functional userbased clustering of related genes [3,4,5]. Similar tools that fall into the same generic category include the Database for Annotation, Visualization, and Integrated Discovery (DAVID), PubMatrix, WebGestalt, and Gene Set Enrichment Analysis [14,15,16,17] All of these important and useful applications can create structured text interpretations of complex biological data, but do so using rigid clustering criteria that may possess considerable redundancy or possess limitations in their scope

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call