Abstract

This paper is focused on ancient Chinese Buddhist literature understanding. Buddhist literature incorporates a plethora of dialects and slang, which makes it challenging to extract semantic meaning seamlessly. To address this issue, a Generative Adversarial Network Masking Model (GAN-MM) is proposed to pre-train BERT models. This method focuses on optimizing the Masked Language Model (MLM) and takes advantage of the rich semantics of Buddhist terminologies as well as the absence of functional words in Buddhist literature. Furthermore, a semi-supervised learning algorithm was developed to train the GAN-MM. The model was evaluated by assessing its performance on two private tasks related to Buddhist literature understanding: sentiment classification and text segmentation, as well as two public tasks pertaining to the ancient Chinese understanding. The experimental results demonstrated a significant enhancement in the pre-training of BERT models when utilizing GAN-MM, as compared to conventional MLM methods. A large-scale Buddhist dataset, including 20,075 utterance documents and 146 million tokens, is public released at https://data.mendeley.com/datasets/5hzs8w46jh/1.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.