Abstract

BackgroundAnnotation of a set of genes is often accomplished through comparison to a library of labelled gene sets such as biological processes or canonical pathways. However, this approach might fail if the employed libraries are not up to date with the latest research, don't capture relevant biological themes or are curated at a different level of granularity than is required to appropriately analyze the input gene set. At the same time, the vast biomedical literature offers an unstructured repository of the latest research findings that can be tapped to provide thematic sub-groupings for any input gene set.MethodsOur proposed method relies on a gene-specific text corpus and extracts commonalities between documents in an unsupervised manner using a topic model approach. We automatically determine the number of topics summarizing the corpus and calculate a gene relevancy score for each topic allowing us to eliminate non-specific topics. As a result we obtain a set of literature topics in which each topic is associated with a subset of the input genes providing directly interpretable keywords and corresponding documents for literature research.ResultsWe validate our method based on labelled gene sets from the KEGG metabolic pathway collection and the genetic association database (GAD) and show that the approach is able to detect topics consistent with the labelled annotation. Furthermore, we discuss the results on three different types of experimentally derived gene sets, (1) differentially expressed genes from a cardiac hypertrophy experiment in mice, (2) altered transcript abundance in human pancreatic beta cells, and (3) genes implicated by GWA studies to be associated with metabolite levels in a healthy population. In all three cases, we are able to replicate findings from the original papers in a quick and semi-automated manner.ConclusionsOur approach provides a novel way of automatically generating meaningful annotations for gene sets that are directly tied to relevant articles in the literature. Extending a general topic model method, the approach introduced here establishes a workflow for the interpretation of gene sets generated from diverse experimental scenarios that can complement the classical approach of comparison to reference gene sets.

Highlights

  • Annotation of a set of genes is often accomplished through comparison to a library of labelled gene sets such as biological processes or canonical pathways

  • The vast biomedical literature offers an unstructured repository of the latest research findings that can be tapped to provide thematic sub-groupings for the gene set under consideration

  • The LSA approach was later extended to a model called Probabilistic Latent Semantic Analysis (PLSA) which models each word in a document as a sample from a mixture model [3]

Read more

Summary

Introduction

Annotation of a set of genes is often accomplished through comparison to a library of labelled gene sets such as biological processes or canonical pathways This approach might fail if the employed libraries are not up to date with the latest research, don’t capture relevant biological themes or are curated at a different level of granularity than is required to appropriately analyze the input gene set. The vast biomedical literature offers an unstructured repository of the latest research findings that can be tapped to provide thematic sub-groupings for any input gene set. Given the complexity of biological reference gene sets and might fail if the employed libraries are not up to date with the latest research, don’t capture relevant biological themes or are curated at a different level of granularity than is required to appropriately analyze the input gene set. PLSA represented a more direct approach to model the data than LSA, but its lack of a probabilistic model at the document level led to the development of Latent Dirichlet Model (LDA) [4]

Methods
Results
Discussion
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.