Алгоритм сегментації слів на основі пошуку найкоротшого шляху в графі

D V Lande,B A Berezin,O Yu Pavlenko

doi:10.35681/1560-9189.2017.19.4.142917

Abstract

The features of word segmentation algorithms from such texts are considered. There are two main models, namely, statistical one and the one using a dictionary. For models with a dictionary, a variant of the maximal matching algorithm is noted for which there are modifications such as Forward Maximal Matching (FMM) and Backward Maximal Matching (BMM) to be depending on the direction of text processing. The second option for models with a dictionary is an algorithm that finds segmentation with a minimum number of words. A new algorithm for words segmenting based on a modified wave algorithm has been presented. The algorithm takes into account the features of the input data and is built in such a way that the necessary calculations are performed in a single pass. This reduces its computational complexity. A description of the word segmentation algorithm is given. An example is shown of splitting an input string in English into words, representing it in the form of a graph and finding the shortest path.To assess the quality of segmentation, the EDWS (Edit Distance of the Word Separator) method is presented. A special tool was used to assess the segmentation of Chinese words with a test corpora based on news texts. Evaluations of the quality of segmentation of words for the proposed algorithm (based on the search for the shortest path) and a number of other known segmentators are obtained. An example of segmentation of a news text in Russian is given. The possibilities of using the developed algorithm in the problems of information search in national resources of the Internet are shown. The implementation of the word segmentation algorithm is used when creating a generalized domain model based on monitoring of the Chinese Internet segment resources.The increase in the number of information resources of the Chinese Internet segment makes it necessary to create of global information retrieval systems. For search indexes of such systems, fast, accurate and complete segmentation of words from texts is necessary. The obtained estimates of segmentation quality using the proposed algorithm for the formation of the search system index indicate the possibility of its use for information resources of the Chinese Internet-segment.

Full Text