Abstract

Text classification is a task to assign text documents according to its content to one or more classes automatically. Recently character-level models using deep neural networks have been developed to do classification text. Moreover, in some cases, character-level models have outperformed word-level models and traditional models, especially on user-generated dataset. The topologies that have been used for the character-level models are convolutional neural networks (CNN) and bidirectional recurrent neural networks (Bi-RNN), with its variants; long short-term memory (LSTM) and gated recurrent units (GRU). In this paper, CNN, Bi-RNN, and the combination of both are tested with character-level features and word-level features for text classification on English and Indonesian social media datasets. On small size datasets, word-level model outperformed character-level models. However, on dataset with millions of data, character-level model outperformed word-level model. Further analysis on the evaluation of word-level and character-level models is also discussed in this paper.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.