Abstract
Given a set of documents and an input query that is expressed using natural language, the problem of document search is retrieving all relevant documents ordered by the degree of relevance. Semantic document search fetches not only documents that contain words from the input query, but also documents that are semantically relevant. For example, the query friendly pets will consider documents that contain the words dog and cat, among others. One way to implement semantic search is to use a probabilistic graph in which the input query is connected to the documents through paths that contain semantically similar words and phrases, where we use WordNet to initially populate the graph. Each edge in the graph is labeled with the conditional probability that the destination node is relevant given that the source node is relevant. Our semantic document search algorithm works in two phases. In the first phase, we find all documents in the graph that are close to the input query and create a bounded subgraph that includes the query, the found documents, and the paths that connect them. In the second phase, we simulate multiple random walks. Each random walk starts at the input query and continues until a document is reached, a jump outside the bounding subgraph is made, or the number of allowed jumps is exhausted. This allows us to rank the documents based on the number of random walks that terminated in them. We experimentally validated the algorithm on the Cranfield benchmark that contains 1400 documents and 225 natural language queries. We show that we achieve higher value for the mean average precision (MAP) measure than a keywords-based search algorithm and a previously published algorithm that relies on a variation of the probabilistic graph.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.