Review of the Classification of Massive Chinese Texts Based on Spark

Liu Yu,Yansong Wang

doi:10.1051/matecconf/201823201039

Abstract

As the Internet develops rapidly, the number of texts is also growing rapidly. Whether it is the content of online emails exchanged by people, or the online novels and other literary contents, or news reports, personal blogs, Weibo or comments, they are constantly increasing the amount of text at all times. However, most of the data is not classified or processed, which causes a lot of spam, junk information, meaningless articles or advertisements. Their production not only consumes a lot of Internet resources, but also affects users' online experience and reduces the users' work and study efficiency. Therefore, it is vital accurately classify a large amount of text, judge its nature according to the classification result, and carry out targeted treatment. The classification of massive texts based on Spark framework is reviewed in this paper.

Highlights

With the development of Internet technology and social media, massive network text data has been derived
Many predecessors have organically combined the framework of big data with traditional machine learning, in order to solve the problem that traditional text classification can not complete the classification of massive texts [11]
The deep learning algorithms have achieved amazing results in the field of image recognition and speech recognition [22] [23]. It is mainly used in natural language processing and semantic mining, such as the presenting of algorithms of word vector, convolutional neural network (CNN) [24] [25], and recurrent neural network (RNN)

Summary

Introduction

With the development of Internet technology and social media, massive network text data has been derived. How to efficiently classify massive text data has important theoretical significance and application value [1], and how to efficiently extract valuable information in massive text information has become a research hot spot [2]. The MapReduce framework is the most widely used big data parallel computing framework. People have attached more attention to the research on parallel text classification algorithms under the MapReduce framework. The disadvantage of the MapReduce framework is that it stores intermediate results on HDFS during parallel computing, leading to a large amount of IO overhead. While the Spark framework is a parallel framework based on memory computing, and it does not directly store the intermediate results on the disk during the performance process (the data portion is cached to disk only when the memory is insufficient), so the performance efficiency of Spark framework is relatively good [12]

Current situation of text classification

PREPROCESSING OF TEXT

Literature and Art

Text vectorization

Text classification algorithm

Findings

Conclusion

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: MATEC Web of Conferences	Publication Date: Jan 1, 2018
Citations: 1	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Review of the Classification of Massive Chinese Texts Based on Spark

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: MATEC Web of Conferences

Lead the way for us

Similar Papers

Study on Massive Text Classification Mining Grid System
Jian Yang ... Mei Sun
-
Jian Yang, et. al.Jian Yang ... Mei Sun
01 Jul 2010
01 Jul 2010

News Text Classification Method and Simulation Based on the Hybrid Deep Learning Model
Ningfeng Sun ... Zhihan Lv
Complexity | VOL. 2021
Ningfeng Sun, et. al.Ningfeng Sun ... Zhihan Lv
18 Jun 2021
Complexity | VOL. 2021

Evolution of Autism Support and Understanding Via the World Wide Web
Chloe J. Jordan
Intellectual and Developmental Disabilities | VOL. 48
Chloe J. JordanChloe J. Jordan
01 Jun 2010
Intellectual and Developmental Disabilities | VOL. 48

Chinese Short Text Classification Based On Deep Learning
Xi He ... Jianping Li
-
Xi He, et. al.Xi He ... Jianping Li
17 Dec 2021
17 Dec 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Review of the Classification of Massive Chinese Texts Based on Spark

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: MATEC Web of Conferences