Automatic structuring and retrieval of large text files

Gerard Salton,Chris Buckley,James Allan

doi:10.1145/175235.175243

Abstract

In many operational environments, large text files must be processed covering a wide variety of different topic areas. Aids must then be provided to the user that permit collection browsing and make it possible to locate particular items on demand. The conventional text analysis methods based on preconstructed knowledge-bases and other vocabulary-control tools are difficult to apply when the subject coverage is unrestricted. An alternative approach, applicable to text collections in any subject area, is introduced which uses the document collections themselves as a basis for the text analysis, together with sophisticated text matching operations carried out at several levels of detail. Methods are described for relating semantically similar pieces of text, and for using the resulting hypertext structures for collection browsing and information retrieval.

Full Text