INTELLIGENT NLP-DRIVEN TEXT CLASSIFICATION

Roberto Basili,Alessandro Moschitti

doi:10.1142/s0218213002000952

Abstract

Information Retrieval (IR) and NLP-driven Information Extraction (IE) are complementary activities. IR helps in locating specific documents within a huge search space (localization) while IE supports the localization of specific information within a document (extraction or explanation). In application scenarios both capabilities are usually needed. IE is important here, as it can enrich the IR inferences with motivating information. Works on Web-based IR suggest that embedding linguistic information (e.g. sense distinctions) at a suitable level within traditional quantitative approaches (e.g. query expansion as in [26]) is a promising approach. "Which linguistic level is best suited to which IR mechanism" is the interesting representational problem posed by the current research stage. This is also the central concern of this paper. A traditional method for efficient text categorization is here presented. Original features of the proposed model are a self-adapting parameterized weighting model and the use of linguistic information. The key idea is the integration of NLP methods within a robust and efficient TC framework. This allows to combine benefits of large scale and efficient IR with the richer expressivity closer to IE. In this paper we capitalize the systematic benchmarking resources available in TC to extensively derive empirical evidence about the above representational problem. The positive experimental results confirm that the proposed TC framework characterizes as a viable approach to intelligent text categorization on a large scale.

Full Text