REWRITING IDENTIFICATION TECHNOLOGY FOR TEXT CONTENT BASED ON MACHINE LEARNING METHODS

N Kholodna,V Vysotska

doi:10.15588/1607-3274-2022-4-11

Abstract

Context. Paraphrased textual content or rewriting is one of the difficult problems of detecting academic plagiarism. Most plagiarism detection systems are designed to detect common words, sequences of linguistic units, and minor changes, but are unable to detect significant semantic and structural changes. Therefore, most cases of plagiarism using paraphrasing remain unnoticed. Objective of the study is to develop a technology for detecting paraphrasing in text based on a classification model and machine learning methods through the use of Siamese neural network based on recurrent and Transformer type – RoBERTa to analyze the level of similarity of sentences of text content. Method. For this study, the following semantic similarity metrics or indicators were chosen as features: Jacquard coefficient for shared N-grams, cosine distance between vector representations of sentences, Word Mover’s Distance, distances according to WordNet dictionaries, prediction of two ML models: Siamese neural network based on recurrent and Transformer type - RoBERTa. Results. An intelligent system for detecting paraphrasing in text based on a classification model and machine learning methods has been developed. The developed system uses the principle of model stacking and feature engineering. Additional features indicate the semantic affiliation of the sentences or the normalized number of common N-grams. An additional fine-tuned RoBERTa neural network (with additional fully connected layers) is less sensitive to pairs of sentences that are not paraphrases of each other. This specificity of the model may contribute to incorrect accusations of plagiarism or incorrect association of user-generated content. Additional features increase both the overall classification accuracy and the model’s sensitivity to pairs of sentences that are not paraphrases of each other. Conclusions. The created model shows excellent classification results on PAWS test data: precision – 93%, recall – 92%, F1score – 92%, accuracy – 92%. The results of the study showed that Transformer-type NNs can be successfully applied to detect paraphrasing in a pair of texts with fairly high accuracy without the need for additional feature generation.

Full Text