A Comparison of Approaches to Chinese Word Segmentation in Hadoop

Zhangang Wang,Bangjie Meng

doi:10.1109/icdmw.2014.43

Abstract

Today, we're surrounded by data especially Chinese information. The exponential growth of data first presented challenges to cutting-edge businesses such as Alibaba, Jingdong, Amazon, and Microsoft. They need to go through terabytes and petabytes of data to figure out which websites were popular, what books were in demand, and what kinds of ads appealed to people. Chinese word segmentation is a computer problem in Chinese information processing, and the Chinese word segmentation algorithm is one of the core, but because of the different characteristics of the environment morpheme in English, making the Chinese must solve word problems. Chinese lexical analysis is the foundation and key Chinese information processing. IKAnalyzer (IK) and ICTCLAS (IC) is a very popular Chinese word segmentation algorithm. At present, these two algorithms in Chinese segmentation play an important role in solving the text data. If the two algorithms are well applied to Hadoop distributed environment, will have better performance. In this paper we compare IK and IC algorithm performance by the theory and experiments. This paper reports the experimental work on the mass Chinese text segmentation problem and its optimal solution using Hadoop cluster, Hadoop Distributed File System (HDFS) for storage and using parallel processing to process large data sets using Map Reduce programming framework. We have done prototype implementation of Hadoop cluster, HDFS storage and Map Reduce framework for processing large text data sets by considering prototype of big data application scenarios. The results obtained from various experiments indicate favorable results of above IC and IK algorithm to address mass Chinese text segmentation problem. (Addressing Big Data Problem Using Hadoop and Map Reduce). Furthermore, we evaluate both kinds of segmentation in terms of performance. Although the process to load data into and tune the execution of parallel distributed system took much longer than the centralized system, the observed performance of these word segmentation algorithms were strikingly better.

Full Text