To cope with the challenges posed by the complex linguistic structure and lexical polysemy in ancient texts, this study proposes a two-stage translation model. First, we combine GujiBERT, GCN, and LSTM to categorize ancient texts into historical and non-historical categories. This categorization lays the foundation for the subsequent translation task. To improve the efficiency of word vector generation and reduce the limitations of the traditional Word2Vec model, we integrated the entropy weight method in the hopping lattice training process and spliced the word vectors with GujiBERT. This improved method improves the efficiency of word vector generation and enhances the model’s ability to accurately represent lexical polysemy and grammatical structure in ancient documents through dependency weighting. In training the translation model, we used a different dataset for each text category, significantly improving the translation accuracy. Experimental results show that our categorization model improves the accuracy by 5% compared to GujiBERT. In contrast, the Entropy-SkipBERT improves the BLEU scores by 0.7 and 0.4 on historical and non-historical datasets. Ultimately, the proposed two-stage model improves the BLEU scores by 2.7 over the baseline model.
Read full abstract