Sentence Embedding and Convolutional Neural Network for Semantic Textual Similarity Detection in Arabic Language

Adnen Mahmoud,Mounir Zrigui

doi:10.1007/s13369-019-04039-7

Abstract

The continuous increase in extraordinary textual sources on the web has facilitated the act of paraphrase. Its detection has become a challenge in different natural language processing applications (e.g., plagiarism detection, information retrieval and extraction, question answering, etc.). Different from western languages like English, few works have been addressed the problem of extrinsic paraphrase detection in Arabic language. In this context, we proposed a deep learning-based approach to indicate how original and suspect documents expressed the same meaning. Indeed, word2vec algorithm extracted the relevant features by predicting each word to its neighbors. Subsequently, averaging the obtained vectors was efficient for generating sentence vectors representations. Then, convolutional neural network was useful to capture more contextual information and compute the degree of semantic relatedness. Faced to the lack of resources publicly available, paraphrased corpus was developed using skip gram model. It had better performance in replacing an original word by its most similar one that had the same grammatical class from a vocabulary. Finally, the proposed system achieved good results enhancing an efficient contextual relationship detection between Arabic documents in terms of precision (85%) and recall (86.8%) than previous studies.

Full Text