Abstract

Aiming to handle the complexity inherent to the human textual communication, Automatic Document Classification (ADC) methods often adopt several simplifications. One such simplification is to consider independent the terms that compose documents, which may hide important relationships between them. These relationships can encapsulate non-trivial and effective patterns to improve classification effectiveness. In this work, we propose NetClass, a new network-based model for documents that explicitly considers term relationships and introduce a family of relational algorithms for ADC, such as the LRN-WRN classifier—a lazy relational ADC algorithm that not only exploits relationships between terms but also neighborhood information. As our extensive experimental evaluation shows, the proposed LRN-WRM achieves competitive performance when compared to the state-of-the-art in ADC, including SVM, considering seven distinct domains. More specifically, LRN-WRN outperforms state-of-the-art classifiers in 5 out of 7 domains, being within the top-2 best-performing classifier in all assessed domains. Our evaluation highlights the high effectiveness of our proposal, as well as its efficiency in terms of runtime. Indeed, besides effectiveness and efficiency, the simplicity and the absence of a complex parameter tuning of our proposal are key characteristics that make our algorithms interesting alternatives for ADC. Particularly, as highlighted by our experimental evaluation, LRN-WRM was shown to be a promising alternative to dynamic domains with a huge volume of short texts (e.g., social media content) or with several classes.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.