Abstract

Cross-language sentence similarity computation is among the focuses of research in natural language processing (NLP). At present, some researchers have introduced fine-grained word and character features to help models understand sentence meanings, but they do not consider coarse-grained prior knowledge at the sentence level. Even if two cross-linguistic sentence pairs have the same meaning, the sentence representations extracted by the baseline approach may have language-specific biases. Considering the above problems, in this paper, we construct a Chinese–Uyghur cross-lingual sentence similarity dataset and propose a method to compute cross-lingual sentence similarity by fusing multiple features. The method is based on the cross-lingual pretraining model XLM-RoBERTa and assists the model in similarity calculation by introducing two coarse-grained prior knowledge features, i.e., sentence sentiment and length features. At the same time, to eliminate possible language-specific biases in the vectors, we whitened the sentence vectors of different languages to ensure that they were all represented under the standard orthogonal basis. Considering that the combination of different vectors has different effects on the final performance of the model, we introduce different vector features for comparison experiments based on the basic feature splicing method. The results show that the absolute value feature of the difference between two vectors can reflect the similarity of two sentences well. The final F1 value of our method reaches 98.97%, which is 19.81% higher than that of the baseline.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call