Abstract

Automatic Grammar Error Correction (GEC) detects and corrects various types of syntax, spelling, and grammatical errors. Different approaches such as rule-based, Statistical Machine Translation (SMT), and Neural Machine Translation (NMT) have been proposed. Among these approaches, NMT based on seq2seq multi-head attention (Transformer) performs the best. The key shortcoming of GEC seq2seq models with multiple encoder-decoder layers is that only the top layer is exploited in the subsequent processes. In addition, due to the exposure bias problem during inference, some of the previous target words are deleted and replaced by other words generated by the model itself, which leads to unsatisfactory output. This paper proposed GEC model based on seq2seq Transformer for low-resource languages such as Arabic to address these issues. Initially, we proposed a noising method for constructing synthetic parallel data to overcome the bottleneck arising from the lack of corpus. Furthermore, motivated by the success of capsule networks in computer vision, we used the Expectation-Maximization routing algorithm to dynamically aggregate information across layers in Arabic GEC. Moreover, to conquer the exposure bias problem, we introduced a bidirectional regularization term using Kullback-Leibler divergence in the training objective to improve the agreement between Right-to-left and Left-to-right models. Experiments performed on two benchmarks QALB-2014 and QALB-2015 showed that our proposed model achieved the best F1 score compared to the existing Arabic GEC systems.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call