Abstract
We present a simple approach for the classification of “noisy” documents using bigrams and named entities. The approach combines conventional feature selection with a contextual approach to filter out passages around selected features. Originally designed for call for tender documents, the method can be useful for other web collections that also contain non-topical contents. Experiments are conducted on our in-house collection as well as on the 4-Universities data set, Reuters 21578 and 20 Newsgroups. We find a significant improvement on our collection and the 4-Universities data set (10.9% and 4.1%, respectively). Although the best results are obtained by combining bigrams and named entities, the impact of the latter is not found to be significant.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.