In this paper we describe an information retrieval system in which advanced natural language processing techniques are used to enhance the effectiveness of term-based document retrieval. The backbone of our system is a traditional statistical engine that builds inverted index files from pre-processed documents, and then searches and ranks the documents in response to user queries. Natural language processing is used to (a) preprocess the documents in order to extract content-carrying terms, (b) discover inter-term dependencies and build a conceptual hierarchy specific to the database domain, and (c) process the user's natural language requests into effective search queries. During the course of the Text REtrieval Conferences, TREC-1 and TREC-2, ∗ ∗ See Harman (1993) for a detailed description of TREC. our system has evolved from a scaled-up prototype, originally tested on such collections as CACM-3204 and Cranfield, to its present form, which can be effectively used to process hundreds of millions of words of unrestricted text.
Read full abstract