Improving Utterance Rewriter Based on MMI and Text Data Augmentation

Lina Yang,Zuqiang Meng,Wei Li,Patrick Shen-Pei Wang,Huiwu Luo,Hai Lin,Xichun Li

doi:10.1142/s021800142259011x

Abstract

In multi-round dialogue tasks, how to maintain the consistency of model answers is a major research challenge. Every answer to the model should be time dependent, causal, and logical. In order to maintain the consistency of the personality, dialogue style, and context of the model, it is necessary to retain the key information in the historical dialogue as much as possible so that the model can generate more accurate answers. Utterance rewriting is a technique that replenishes the information of the current sentence by analyzing the historical dialogue, so as to retain the key information. This paper mainly uses text augmentation, Maximum Mutual Information (MMI) method and character correction method based on Knuth–Morria–Pratt (KMP) algorithm to improve the effect of utterance rewriting generation. The number of original statement rewriting datasets is limited, and the cost of manual manufacturing is too high. By using the method of text data augmentation based on coreference resolution, the positive dataset that is missing from the statement rewriting dataset is repaired. At the same time, the existing datasets are expanded to increase the number of data. The generated results are optimized by using the MMI method, and the KMP character correction method is used to modify the wrong characters to improve the overall accuracy.

Full Text