Abstract

Traditional Vietnamese word segmentation methods do not perform well in the face of Vietnamese ambiguity, in response to the enormous challenge posed by the scarcity of the Vietnamese corpus to language processing. We first investigated the most advanced deep neural network method. According to the ambiguity problem of Vietnamese word segmentation, we then proposed a Vietnamese word segmentation processing technology based on an improved long short-term memory neural network (LSTM), which is made up of an LSTM encoding and a CNN feature extraction portion. The previous important information is kept in the memory unit; the word segmentation processing task is refined into a classification problem and a sequence labeling problem, which can gain the useful features of the word segmentation character and word level automatically. The limitation of the local context window size is avoided, and the word segmentation processing task is refined into a classification problem and a sequence labeling problem. Finally, validated by a homemade Vietnamese news website crawler dataset, the experimental results show that, compared with the single LSTM, single CNN methods, and traditional methods, the performance improvement of our proposed method is more obvious. In the Vietnamese word separation task, the accuracy reaches 96.6%, the recall reaches 95.2%, and the F1 value reaches 96.3%, which is significantly better than the traditional methods CNN and LSTM.

Highlights

  • In the field of linguistic information processing, there has been a lot of studies on word segmentation. e research findings are divided into three types: dictionary-based word segmentation methods, statistics-based word segmentation methods, and understanding-based word segmentation methods. e dictionary-based word segmentation method matches the character string to be studied with the entries of a machine dictionary that has been artificially created according to a strategy

  • Based on previous work, combining Vietnamese word-formation features and language features, we propose a model based on long short-term memory neural network (LSTM), which is determined by input, output, and forgetting gates of how to use previous information to model and update the memory of previous information

  • In order to realize the Vietnamese word segmentation task, an improved LSTM neural network framework is proposed, and the Vietnamese word segmentation task is separated into a classification part and a sequence labeling part

Read more

Summary

Introduction

Under the current dual promotion of economic globalization and artificial intelligence. E dictionary-based word segmentation method matches the character string to be studied with the entries of a machine dictionary that has been artificially created according to a strategy If it is successfully matched with the string in the character dictionary, following that, word segmentation is carried out. (i) Firstly, we introduced the relevant research work in the direction of language processing and proposed the study of Vietnamese word segmentation in response to the scarcity of the Vietnamese corpus. (ii) traditional methods are not effective in processing Vietnamese word segmentation ambiguity models, we have studied methods based on deep neural networks for Vietnamese word segmentation.

Related Work
Vietnamese
Vietnamese Character
Neural Model for Vietnamese Word Segmentation
Improved LSTM Model of Vietnamese
Datasets
Training
Evaluation Metrics
Experimental Result
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call