Abstract

Huge collection of documents is available at few mouse clicks. The current World Wide Web is a web of pages. Users have to guess possible keywords that might lead through search engines to the pages that contain information of interest and browse hundreds or even thousands of the returned pages in order to obtain what they want. In our work we build a generalized suffix tree for our documents and propose a search technique for retrieving documents based on a sort of phrase called word sequences. Our proposed method efficiently searches for a given phrase (with missing or additional words in between) with better performance. Keywords-Document retrieval; Frequent Word Sequences; Suffix tree; Traversal technique. I. INTRODUCTION With the growth of web, hundreds of millions of people engage in information retrieval process every day when they use web search engine or search their emails. IR is fast becoming the dominant form of information access, overtaking traditional database style searching. IR process begins when user enters a query like search strings in web search engines, phrases etc. to identify the related documents or URLs. Now almost all the documents have electronic copies. With the development of WWW it is an efficient technique to retrieve the documents using the web search engines based on a query. But this should not be time consuming. That is the reason precision of the retrieval of related documents for a given query is vital for the search engine. Cluster based information retrieval techniques also exist (11). The next section deals with the Information Retrieval and its related work on text documents. Section 3 describes Suffix Tree. Section 4 deals with building generalized suffix tree. Section 5 explains traversal technique Algorithm used for quick retrieval of documents. Section 6 shows the experiment

Highlights

  • With the growth of web, hundreds of millions of people engage in information retrieval process every day when they use web search engine or search their emails

  • The goal of this step is to reduce the dimension of the database by eliminating those words that are not frequent enough to be in a frequent kword sequence, for k >= 2

  • After building the suffix tree as mentioned above, we traverse the tree for a given word sequence ― eat chocolates ‖, which should retrieve all the documents that contain “children eat chocolates”, “children eat dry fruits and chocolates” “ children of four years eat many chocolates”

Read more

Summary

INTRODUCTION

With the growth of web, hundreds of millions of people engage in information retrieval process every day when they use web search engine or search their emails. IR process begins when user enters a query like search strings in web search engines, phrases etc. With the development of WWW it is an efficient technique to retrieve the documents using the web search engines based on a query. That is the reason precision of the retrieval of related documents for a given query is vital for the search engine. Cluster based information retrieval techniques exist [11]. The section deals with the Information Retrieval and its related work on text documents.

RELATED WORK
SUFFIX TREE
Definition
CONSTRUCTION OF SUFFIX TREES FOR DOCUMENTS
Finding frequent 2-word sets
EXPERIMENTAL SETUP
Cleaning of documents and generating suffixes
Generating Suffixes and Building GST
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call