Abstract
Chinese diachronic gap is a key issue in classical Chinese machine reading comprehension (CCMRC). Preceding work on bridging this gap has been mostly restricted to limited monolingual classical Chinese corpora pre-training and lexical knowledge integration, which require a great deal of human resources. In this paper, we propose a cross-guidance cross-lingual model (CGCLM), pre-trained on a classical and modern Chinese parallel corpus generated from a large language model, to bridge the Chinese diachronic gap and reduce the manual effort. The CGCLM facilitates accurate translation by providing in-context examples and feedback based on the longest common substring between source and target sentences, thereby avoiding untranslated Chinese words. Specifically, we consider three pre-training tasks, i.e., cross-masked language modeling, linguistic label cross-prediction, and semantic cross-aware translation language modeling. The knowledge acquired from masked tokens uncovering and linguistic label predicting can lead to the implicit semantic alignment between two language styles. Taking advantage of the semantic similarity between the same syntactic levels of parallel pairs, cross-aware modeling integrates and transmits contextualized semantic information. We utilize an 18.6G monolingual corpus to create a 37.2G parallel corpus. Manual evaluation has resulted in only acceptable discrepancies between our generated and human-edited parallel corpora. Extensive experimental results show that our proposed model outperforms the state-of-the-art by an average accuracy of 3.13%, 2.44%, and 2.17% on CCMRC, classical Chinese language understanding evaluation (CCLUE), and modern Chinese language understanding evaluation (MCLUE) tasks.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.