Abstract
This article studies aggressive word removal in text categorization to reduce the noise in free texts and to enhance the computational efficiency of categorization. We use a novel stop word identification method to automatically generate domain specific stoplists which are much larger than a conventional domain-independent stoplist. In our tests with three categorization methods on text collections from different domains/applications, significant numbers of words were removed without sacrificing categorization effectiveness. In the test of the Expert Network method on CACM documents, for example, an 87% removal of unique words reduced the vocabulary of documents from 8,002 distinct words to 1,045 words, which resulted in a 63% time savings and a 74% memory savings in the computation of category ranking, with a 10% precision improvement on average over not using word removal. It is evident in this study that automated word removal based on corpus statistics has a practical and significant impact on the computational tractability of categorization methods in large databases.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: Journal of the American Society for Information Science
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.