Deep Learning Approach to Detect Plagiarism in Sinhala Text

Tharuka Kasthuriarachchi,E Y A Charles

doi:10.1109/iciis47346.2019.9063299

Abstract

In simple terms, plagiarism is taking someone else's work or ideas and passing them without giving credit to the original author. Due to the availability of a vast amount of documents on the Internet, plagiarism has become a severe problem in this digital world. Hence the need for an automatic plagiarism detection tool is inevitable. There is a right amount of existing research on automated plagiarism detection, and there are several language-dependent and language-independent plagiarism detection tools available. The lack of adequate research for Sinhala plagiarism detection and the inefficiency of plagiarism detection tools for Sinhala text motivated this research which focuses on developing a Deep learning approach for plagiarism detection in Sinhala documents. In natural language processing (NLP), Word embedding has become a popular language modelling and feature learning technique. A word embedding model is capable of representing a word or a phrase as a vector of real numbers known as a word vector. A word vector represents the semantic and syntactic similarity of a word or phrase with other words in a corpus. Further, a sentence can be represented as a vector of values using the word vectors for words in that sentence. This research work proposes a method for Sinhala plagiarism detection by representing sentences as a vector of values and comparing the sentences based on these vector of values. For plagiarised sentences, the similarity of the vector of values would be high. A word embedding model is built using a Deep learning neural network and a Sinhala text corpus as part of this study. The word2vec algorithm and the publicly available UCSC_Sinhala_News corpus are used for this purpose. In the proposed model, a simple aggregation method is used to represent a sentence as a vector of values from the word vectors for words in that sentence. The cosine similarity and soft-cosine similarity metrics are used to quantify the similarity of sentences represented as a vector of values. The sentence pairs with the similarity scores higher than a threshold value are considered as plagiarised. The proposed model was implemented and tested on a newly created data set and found to be capable of detecting plagiarism with an accuracy of 97%. The model is found to be detecting direct and sophisticated copying such as replacing words with synonyms or changing the order of words in a sentence.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Deep Learning Approach to Detect Plagiarism in Sinhala Text

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Word Embeddings for Natural Language Processing

-

01 Jan 2015
01 Jan 2015

Evaluation and Analysis of Word Embedding Vectors of English Text Using Deep Learning Technique
Jaspreet Singh ... Rajinder Singh
-
Jaspreet Singh, et. al.Jaspreet Singh ... Rajinder Singh
01 Jan 2018
01 Jan 2018

Computer-based plagiarism detection methods and tools
Romans Lukashenko ... Vita Graudina
-
Romans Lukashenko, et. al.Romans Lukashenko ... Vita Graudina
01 Jan 2007
01 Jan 2007

A reproducible survey on word embeddings and ontology-based methods for word similarity: Linear combinations outperform the state of the art
Juan J Lastra-Díaz ... Eneko Agirre
Engineering Applications of Artificial Intelligence | VOL. 85
Juan J Lastra-Díaz, et. al.Juan J Lastra-Díaz ... Eneko Agirre
01 Aug 2019
Engineering Applications of Artificial Intelligence | VOL. 85

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Deep Learning Approach to Detect Plagiarism in Sinhala Text

Abstract

Talk to us

Similar Papers