Abstract

This paper proposes a new method for query expansion based on bidirectional extraction of phrases as word n-grams from research paper titles. The proposed method aims to extract information relevant to users’ needs and interests and thus to provide a useful system for technical paper retrieval. The outcome of proposed method are the trigrams as phrases that can be used for query expansion. First, word trigrams are extracted from research paper titles. Second, a co-occurrence graph of the extracted trigrams is constructed. To construct the co-occurrence graph, the direction of edges is considered in two ways: forward and reverse. In the forward and reverse co-occurrence graphs, the trigrams point to other trigrams appearing after and before them in a paper title, respectively. Third, Jaccard similarity is computed between trigrams as the weight of the graph edge. Fourth, the weighted version of PageRank is applied. Consequently, the following two types of phrases can be obtained as the trigrams associated with the higher PageRank scores. The trigrams of the one type, which are obtained from the forward co-occurrence graph, can form a more specific query when users add a technical word or words before them. Those of the other type, obtained from the reverse co-occurrence graph, can form a more specific query when users add a technical word or words after them. The extraction of phrases is evaluated as additional features in the paper title classification task using SVM. The experimental results show that the classification accuracy is improved than the accuracy achieved when the standard TF-IDF text features are only used. Moreover, the trigrams extracted by the proposed method can be utilized to expand query words in research paper retrieval.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call