Research on Word Vector Training Method Based on Improved Skip-Gram Algorithm

Yachun Tang,Qiangyi Li

doi:10.1155/2022/4414207

Abstract

Through the effective word vector training method, we can obtain semantic-rich word vectors and can achieve better results on the same task. In view of the shortcomings of the traditional skip-gram model in coding and modeling the processing of context words, this study proposes an improved word vector-training method based on skip-gram algorithm. Based on the analysis of the existing skip-gram model, the concept of distribution hypothesis is introduced. The distribution of each word in the word context is taken as the representation of the word, the word is put into the semantic space of the word, and then the word is modelled, which is better modelled by the smoothing of words and the semantic space of words. In the training process, the random gradient descent method is used to solve the vector representation of each word and each Chinese character. The proposed training method is compared with skip gram, CWE + P , and SEING by using word sense similarity task and text classification task in the experiment. Experimental results showed that the proposed method had significant advantages in the Chinese-word segmentation task with a performance gain rate of about 30%. The method proposed in this study provides a reference for the in-depth study of word vector and text mining.

Highlights

Nowadays, prior training, word vectors have become necessary modules for many natural language processing tasks and machine learning tasks
In order to further prove that the improved word vector training method can improve the semantics of words compared with the original traditional skip-gram word vector model, the main comparison method of this experiment is the traditional skip-gram model, as well as the CWE + P model and SEING model improved by predecessors based on skip-gram model
Performance gain refers to the relative increase in performance of a word vector over a random word vector on a task. e idea of performance gain rate is that each word vector is only compared to the best word vector under the same conditions

Summary

Introduction

Prior training, word vectors have become necessary modules for many natural language processing tasks and machine learning tasks. With the rise of deep learning in recent years, feature learning methods based on neural networks have brought new ideas for natural language processing [1]. Many researchers have devoted themselves to studying some new word vector models or optimizing them to improve performance. The neural network model based on word vector has improved the performance of multiple natural language processing tasks and even achieved the best results among multiple tasks. In the past two decades, the research on Chinese-word segmentation has achieved rich results [2]. Chinese has the feature of continuous writing of large character set, so it is impossible to solve the problem effectively by using only the dictionary-based matching method

Methods

Results

Conclusion