External Knowledge and Data Augmentation Enhanced Model for Chinese Short Text Matching

Haoyang Ma,Hongyu Guo

doi:10.1007/978-981-99-1645-0_7

Abstract

With the rapid development of the network, a large amount of short text data has been generated in life. Chinese short text matching is an important task of natural language processing, but it still faces challenges such as ambiguity in Chinese words and imbalanced ratio of samples in the training set. To address these problems, we propose an external knowledge and data augmentation enhanced model (EDM) for Chinese short text matching. EDM uses jieba, thulac and ltp to generate word lattices and employs HowNet as an external knowledge source for disambiguation. Additionally, dropout and EDA hybrid model is adopted for data augmentation to balance the proportion of samples in the training set. The experimental results on two Chinese short text datasets show that EDM outperforms most of the existing models. Ablation experiments also demonstrate that external knowledge and data augmentation can significantly improve the model.

Full Text