Abstract
Traditional manual text classification method has been unable to cope with the current huge amount of data volume. The improvement of deep learning technology also accelerates the technology of text classification. Based on this background, we presented different word embedding methods such as word2vec, doc2vec, tfidf and embedding layer. After word embedding, we demonstrated 8 deep learning models to classify the news text automatically and compare the accuracy of all the models, the model ‘2 layer GRU model with pretrained word2vec embeddings’ model got the highest accuracy. Automatic text classification can help people summary the text accurately and quickly from the mass of text information. No matter in the academic or in the industry area, it is a topic worth discussing.
Highlights
In recent years, with the rapid development of Internet technology and information technology, especially with the arrival of the era of big data, a huge amount of data is flooding every field of our life
Text classification is the process of assigning labels to text according to its content, it is one of the fundamental tasks in natural language processing (NLP)
NLP methods change the human language to numeral vectors for machine to calculate, with these word embeddings, researchers can do different tasks such as sentiment analysis, machine translation and natural language inference
Summary
With the rapid development of Internet technology and information technology, especially with the arrival of the era of big data, a huge amount of data is flooding every field of our life. NLP methods change the human language to numeral vectors for machine to calculate, with these word embeddings, researchers can do different tasks such as sentiment analysis, machine translation and natural language inference. Each sample is one vector, the dimension of the vector can be determined by user Both word2vec and Doc2vec are unsupervised learning methods and Doc2vec was developed on the base of Word2vec. Doc2vec method uses the corpus to train the model and use this model to map every sample to a fixed dimension vector. The same corpus is used for word2vec method, after generating the word2vec model, every word will be mapped to a 100-dimension vector. In the embedding layer of a deep learning model, the words will be trained and transferred to the layer, the dimension of each word is set to be 100 here, the same as the word2vec method. The dimension of each sample after these 4 methods is very different as table 1 shows
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.