Abstract

Traditional manual text classification method has been unable to cope with the current huge amount of data volume. The improvement of deep learning technology also accelerates the technology of text classification. Based on this background, we presented different word embedding methods such as word2vec, doc2vec, tfidf and embedding layer. After word embedding, we demonstrated 8 deep learning models to classify the news text automatically and compare the accuracy of all the models, the model ‘2 layer GRU model with pretrained word2vec embeddings’ model got the highest accuracy. Automatic text classification can help people summary the text accurately and quickly from the mass of text information. No matter in the academic or in the industry area, it is a topic worth discussing.

Highlights

  • In recent years, with the rapid development of Internet technology and information technology, especially with the arrival of the era of big data, a huge amount of data is flooding every field of our life

  • Text classification is the process of assigning labels to text according to its content, it is one of the fundamental tasks in natural language processing (NLP)

  • NLP methods change the human language to numeral vectors for machine to calculate, with these word embeddings, researchers can do different tasks such as sentiment analysis, machine translation and natural language inference

Read more

Summary

Introduction

With the rapid development of Internet technology and information technology, especially with the arrival of the era of big data, a huge amount of data is flooding every field of our life. NLP methods change the human language to numeral vectors for machine to calculate, with these word embeddings, researchers can do different tasks such as sentiment analysis, machine translation and natural language inference. Each sample is one vector, the dimension of the vector can be determined by user Both word2vec and Doc2vec are unsupervised learning methods and Doc2vec was developed on the base of Word2vec. Doc2vec method uses the corpus to train the model and use this model to map every sample to a fixed dimension vector. The same corpus is used for word2vec method, after generating the word2vec model, every word will be mapped to a 100-dimension vector. In the embedding layer of a deep learning model, the words will be trained and transferred to the layer, the dimension of each word is set to be 100 here, the same as the word2vec method. The dimension of each sample after these 4 methods is very different as table 1 shows

Deep learning classification models
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call