Abstract

Chinese diachronic gap is a key issue in classical Chinese machine reading comprehension (CCMRC). Preceding work on bridging this gap has been mostly restricted to limited monolingual classical Chinese corpora pre-training and lexical knowledge integration, which require a great deal of human resources. In this paper, we propose a cross-guidance cross-lingual model (CGCLM), pre-trained on a classical and modern Chinese parallel corpus generated from a large language model, to bridge the Chinese diachronic gap and reduce the manual effort. The CGCLM facilitates accurate translation by providing in-context examples and feedback based on the longest common substring between source and target sentences, thereby avoiding untranslated Chinese words. Specifically, we consider three pre-training tasks, i.e., cross-masked language modeling, linguistic label cross-prediction, and semantic cross-aware translation language modeling. The knowledge acquired from masked tokens uncovering and linguistic label predicting can lead to the implicit semantic alignment between two language styles. Taking advantage of the semantic similarity between the same syntactic levels of parallel pairs, cross-aware modeling integrates and transmits contextualized semantic information. We utilize an 18.6G monolingual corpus to create a 37.2G parallel corpus. Manual evaluation has resulted in only acceptable discrepancies between our generated and human-edited parallel corpora. Extensive experimental results show that our proposed model outperforms the state-of-the-art by an average accuracy of 3.13%, 2.44%, and 2.17% on CCMRC, classical Chinese language understanding evaluation (CCLUE), and modern Chinese language understanding evaluation (MCLUE) tasks.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call