Abstract

Text Classification is an important research area in natural language processing (NLP) that has received a considerable amount of scholarly attention in recent years. However, real Chinese online news is characterized by long text, a large amount of information and complex structure, which also reduces the accuracy of Chinese long text classification as a result. To improve the accuracy of long text classification of Chinese news, we propose a BERT-based local feature convolutional network (LFCN) model including four novel modules. First, to address the limitation of Bidirectional Encoder Representations from Transformers (BERT) on the length of the max input sequence, we propose a named Dynamic LEAD-n (DLn) method to extract short texts within the long text based on the traditional LEAD digest algorithm. In Text-Text Encoder (TTE) module, we use BERT pretrained language model to complete the sentence-level feature vector representation of a news text and to capture global features by using the attention mechanism to identify correlated words in text. After that, we propose a CNN-based local feature convolution (LFC) module to capture local features in text, such as key phrases. Finally, the feature vectors generated by the different operations over several different periods are fused and used to predict the category of a news text. Experimental results show that the new method further improves the accuracy of long text classification of Chinese news.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.