Abstract

The classical bag-of-word models for information retrieval (IR) fail to capture contextual associations between words. In this article, we propose to investigate pure high-order dependence among a number of words forming an unseparable semantic entity, that is, the high-order dependence that cannot be reduced to the random coincidence of lower-order dependencies. We believe that identifying these pure high-order dependence patterns would lead to a better representation of documents and novel retrieval models. Specifically, two formal definitions of pure dependence—unconditional pure dependence (UPD) and conditional pure dependence (CPD)—are defined. The exact decision on UPD and CPD, however, is NP-hard in general. We hence derive and prove the sufficient criteria that entail UPD and CPD, within the well-principled information geometry (IG) framework, leading to a more feasible UPD/CPD identification procedure. We further develop novel methods for extracting word patterns with pure high-order dependence. Our methods are applied to and extensively evaluated on three typical IR tasks: text classification and text retrieval without and with query expansion.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.