Abstract

As the Internet develops rapidly, the number of texts is also growing rapidly. Whether it is the content of online emails exchanged by people, or the online novels and other literary contents, or news reports, personal blogs, Weibo or comments, they are constantly increasing the amount of text at all times. However, most of the data is not classified or processed, which causes a lot of spam, junk information, meaningless articles or advertisements. Their production not only consumes a lot of Internet resources, but also affects users' online experience and reduces the users' work and study efficiency. Therefore, it is vital accurately classify a large amount of text, judge its nature according to the classification result, and carry out targeted treatment. The classification of massive texts based on Spark framework is reviewed in this paper.

Highlights

  • With the development of Internet technology and social media, massive network text data has been derived

  • Many predecessors have organically combined the framework of big data with traditional machine learning, in order to solve the problem that traditional text classification can not complete the classification of massive texts [11]

  • The deep learning algorithms have achieved amazing results in the field of image recognition and speech recognition [22] [23]. It is mainly used in natural language processing and semantic mining, such as the presenting of algorithms of word vector, convolutional neural network (CNN) [24] [25], and recurrent neural network (RNN)

Read more

Summary

Introduction

With the development of Internet technology and social media, massive network text data has been derived. How to efficiently classify massive text data has important theoretical significance and application value [1], and how to efficiently extract valuable information in massive text information has become a research hot spot [2]. The MapReduce framework is the most widely used big data parallel computing framework. People have attached more attention to the research on parallel text classification algorithms under the MapReduce framework. The disadvantage of the MapReduce framework is that it stores intermediate results on HDFS during parallel computing, leading to a large amount of IO overhead. While the Spark framework is a parallel framework based on memory computing, and it does not directly store the intermediate results on the disk during the performance process (the data portion is cached to disk only when the memory is insufficient), so the performance efficiency of Spark framework is relatively good [12]

Current situation of text classification
PREPROCESSING OF TEXT
Literature and Art
Text vectorization
Text classification algorithm
Findings
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call