Abstract

We present a passage relevance model for integrating syntactic and semantic evidence of biomedical concepts and topics using a probabilistic graphical model. Component models of topics, concepts, terms, and document are represented as potential functions within a Markov Random Field. The probability of a passage being relevant to a biologist's information need is represented as the joint distribution across all potential functions. Relevance model feedback of top ranked passages is used to improve distributional estimates of query concepts and topics in context, and a dimensional indexing strategy is used for efficient aggregation of concept and term statistics. By integrating multiple sources of evidence including dependencies between topics, concepts, and terms, we seek to improve genomics literature passage retrieval precision. Using this model, we are able to demonstrate statistically significant improvements in retrieval precision using a large genomics literature corpus.

Highlights

  • Traditional retrieval functions, including state-of-the-art probabilistic and language models are typically based on a bag of words assumption where text is represented as unordered sets of terms, and any notion of concept identification, term ordering, or proximity is lost

  • We present a passage retrieval model for capturing semantics through the notion of topic and concept relevance by learning the latent relationships between terms and concepts in relevant passages

  • We presented a passage relevance model based on an undirected graphical model (Markov Random Field), and methods for modeling concepts, terms, and topic relevance as potential functions within the model

Read more

Summary

Introduction

Traditional retrieval functions, including state-of-the-art probabilistic and language models are typically based on a bag of words assumption where text is represented as unordered sets of terms, and any notion of concept identification, term ordering, or proximity is lost. Without modeling contextual dependencies between terms, traditional models are not suitable for disambiguating terms and identifying relevant text without explicit term matching. These issues are relevant when attempting to retrieve passages of text from biological literature where the significant use of ambiguous terms, acronyms, and term variants make identification of biological concepts especially challenging. Use of external knowledge sources coupled with query expansion techniques have been popular methods for identifying concept term variants. An acronym like IP could represent immunoprecipitant or ischemic precondition In this case we can only disambiguate IP if we have sufficient context to understand that one of the topics covered in the document involves (page number not for citation purposes)

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.