Abstract
Query expansion is a classical topic in the field of information retrieval, which is proposed to bridge the gap between searchers' information intents and their queries. Previous researches usually expand queries based on document collections, or some external resources such as WordNet and Wikipedia [1, 2, 3, 4, 5]. However, it seems that independently using one of these resources has some defects, document collections lack semantic information of words, while WordNet and Wikipedia may not include domain-specific knowledge in certain document collection. Our work aims to combine these two kinds of resources to establish an expansion model which represents not only domain-specific information but also semantic information. In our preliminary experiments, we construct a two-layer word graph and use Random-Walk algorithm to calculate the weights of each term in pseudo-relevance feedback documents, then select the highest weighted term to expand original query. The first layer of the word graph contains terms in related documents, while the second layer contains semantic senses corresponding to these terms. These terms and semantic senses are treated as vertices of the graph and connected with each other by all possible relationships, such as mutual information and semantic similarities. We utilized mutual information, semantic similarity and uniform distribution as the weight of term-term relation, sense-sense relation and word-sense relation respectively. Though these experiments show that our expansion outperform original queries, we are troubled with some difficult problems. Given the framework of semantic graph model, we need more effort to find out an optimal graph to represent the relationships between terms and their semantic senses. We utilized a two-layer graph model in our preliminary research, where terms from different documents are treated equally. Maybe we can introduce the document as a third layer in future work, where we can differ the same terms in different documents according to document relevance and context. Then we need appropriately represent initial weights of this words, senses and relationships. Various measures for weights of terms and term relations have been proved effective in other information retrieval tasks, such as TFIDF, mutual information (MI), but there is little research on weights for semantic senses and their relations. For polysemous words, we add all of their semantic senses to the graph and assume that these senses are uniformly distributed. Actually, it is not precise for a word in a special document and query. As we know, a polysemous word may have only one or two senses in a document, and they are not uniformly distributed. Give a word, what we should do is to determine its word senses in a relevant document and estimate the distribution of these senses. Word sense disambiguation may help us in this problem. Then, there are many methods to compute word similarity according to WordNet, which we use to represent the weights of relationships between word senses. Varelas et al implemented some popular methods to compute semantic similarity by mapping terms to an ontology and examining their relationships in that ontology [4]. We also need to know which algorithm for semantic similarity is most suitable for our model. Additional, WordNet is suitable to calculate word similarity but not suitable to measure word relevance. The inner hyperlinks of Wikipedia could help us to calculate word relevance. We wish to find an effective way to combine the similarity measure from WordNet and relevance measure from Wikipedia, which may completely reflect word relationships.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.