Abstract

Information Retrieval (IR) and NLP-driven Information Extraction (IE) are complementary activities. IR helps in locating specific documents within a huge search space (localization) while IE supports the localization of specific information within a document (extraction or explanation). In application scenarios both capabilities are usually needed. IE is important here, as it can enrich the IR inferences with motivating information. Works on Web-based IR suggest that embedding linguistic information (e.g. sense distinctions) at a suitable level within traditional quantitative approaches (e.g. query expansion as in [26]) is a promising approach. "Which linguistic level is best suited to which IR mechanism" is the interesting representational problem posed by the current research stage. This is also the central concern of this paper. A traditional method for efficient text categorization is here presented. Original features of the proposed model are a self-adapting parameterized weighting model and the use of linguistic information. The key idea is the integration of NLP methods within a robust and efficient TC framework. This allows to combine benefits of large scale and efficient IR with the richer expressivity closer to IE. In this paper we capitalize the systematic benchmarking resources available in TC to extensively derive empirical evidence about the above representational problem. The positive experimental results confirm that the proposed TC framework characterizes as a viable approach to intelligent text categorization on a large scale.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.