Statistical modeling of biomedical corpora: mining the Caenorhabditis Genetic Center Bibliography for genes related to life span.

D M Blei,K Franks,M I Jordan,I S Mian

doi:10.1186/1471-2105-7-250

Abstract

BackgroundThe statistical modeling of biomedical corpora could yield integrated, coarse-to-fine views of biological phenomena that complement discoveries made from analysis of molecular sequence and profiling data. Here, the potential of such modeling is demonstrated by examining the 5,225 free-text items in the Caenorhabditis Genetic Center (CGC) Bibliography using techniques from statistical information retrieval. Items in the CGC biomedical text corpus were modeled using the Latent Dirichlet Allocation (LDA) model. LDA is a hierarchical Bayesian model which represents a document as a random mixture over latent topics; each topic is characterized by a distribution over words.ResultsAn LDA model estimated from CGC items had better predictive performance than two standard models (unigram and mixture of unigrams) trained using the same data. To illustrate the practical utility of LDA models of biomedical corpora, a trained CGC LDA model was used for a retrospective study of nematode genes known to be associated with life span modification. Corpus-, document-, and word-level LDA parameters were combined with terms from the Gene Ontology to enhance the explanatory value of the CGC LDA model, and to suggest additional candidates for age-related genes. A novel, pairwise document similarity measure based on the posterior distribution on the topic simplex was formulated and used to search the CGC database for "homologs" of a "query" document discussing the life span-modifying clk-2 gene. Inspection of these document homologs enabled and facilitated the production of hypotheses about the function and role of clk-2.ConclusionLike other graphical models for genetic, genomic and other types of biological data, LDA provides a method for extracting unanticipated insights and generating predictions amenable to subsequent experimental validation.

Highlights

The statistical modeling of biomedical corpora could yield integrated, coarse-to-fine views of biological phenomena that complement discoveries made from analysis of molecular sequence and profiling data
Each Caenorhabditis Genetic Center (CGC) Bibliography free-text item was transformed into a bag of words yielding a corpus of M = 5, 225 documents and a V = 28, 971 word vocabulary
Since the perplexity of 50- and 100-topic Latent Dirichlet Allocation (LDA)'s is low and similar, a latent space with 50 topics appears to provide a parsimonious description of the CGC corpus

Summary

Introduction

The statistical modeling of biomedical corpora could yield integrated, coarse-to-fine views of biological phenomena that complement discoveries made from analysis of molecular sequence and profiling data. The potential of such modeling is demonstrated by examining the 5,225 free-text items in the Caenorhabditis Genetic Center (CGC) Bibliography using techniques from statistical information retrieval. Items in the CGC biomedical text corpus were modeled using the Latent Dirichlet Allocation (LDA) model. LDA is a hierarchical Bayesian model which represents a document as a random mixture over latent topics; each topic is characterized by a distribution over words

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC bioinformatics	Publication Date: May 8, 2006
Citations: 63	License type: cc-by

R Discovery Prime

R Discovery Prime

Statistical modeling of biomedical corpora: mining the Caenorhabditis Genetic Center Bibliography for genes related to life span.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC bioinformatics

Lead the way for us

Similar Papers

A spatial class LDA model for classification of sports scene images
Jin Jeon ... Munchurl Kim
-
Jin Jeon, et. al.Jin Jeon ... Munchurl Kim
01 Sep 2015
01 Sep 2015

Tourism Activity Recognition and Discovery Based on Improved LDA Model
Yifan Yuan ... Jangmyung Lee
-
Yifan Yuan, et. al.Yifan Yuan ... Jangmyung Lee
01 Jan 2015
01 Jan 2015

Annotating Web document in multi-granularity way by-statistical topical model
Liu Yuan ... Long-Bo Zhang
Journal of Computer Applications | VOL. 30
Liu Yuan, et. al.Liu Yuan ... Long-Bo Zhang
07 Jan 2011
Journal of Computer Applications | VOL. 30

Research on Multi-document Summarization Based on LDA Topic Model
Jinqiang Bian ... Qian Chen
-
Jinqiang Bian, et. al.Jinqiang Bian ... Qian Chen
01 Aug 2014
01 Aug 2014

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Statistical modeling of biomedical corpora: mining the Caenorhabditis Genetic Center Bibliography for genes related to life span.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC bioinformatics