Abstract

In simple terms, plagiarism is taking someone else's work or ideas and passing them without giving credit to the original author. Due to the availability of a vast amount of documents on the Internet, plagiarism has become a severe problem in this digital world. Hence the need for an automatic plagiarism detection tool is inevitable. There is a right amount of existing research on automated plagiarism detection, and there are several language-dependent and language-independent plagiarism detection tools available. The lack of adequate research for Sinhala plagiarism detection and the inefficiency of plagiarism detection tools for Sinhala text motivated this research which focuses on developing a Deep learning approach for plagiarism detection in Sinhala documents. In natural language processing (NLP), Word embedding has become a popular language modelling and feature learning technique. A word embedding model is capable of representing a word or a phrase as a vector of real numbers known as a word vector. A word vector represents the semantic and syntactic similarity of a word or phrase with other words in a corpus. Further, a sentence can be represented as a vector of values using the word vectors for words in that sentence. This research work proposes a method for Sinhala plagiarism detection by representing sentences as a vector of values and comparing the sentences based on these vector of values. For plagiarised sentences, the similarity of the vector of values would be high. A word embedding model is built using a Deep learning neural network and a Sinhala text corpus as part of this study. The word2vec algorithm and the publicly available UCSC_Sinhala_News corpus are used for this purpose. In the proposed model, a simple aggregation method is used to represent a sentence as a vector of values from the word vectors for words in that sentence. The cosine similarity and soft-cosine similarity metrics are used to quantify the similarity of sentences represented as a vector of values. The sentence pairs with the similarity scores higher than a threshold value are considered as plagiarised. The proposed model was implemented and tested on a newly created data set and found to be capable of detecting plagiarism with an accuracy of 97%. The model is found to be detecting direct and sophisticated copying such as replacing words with synonyms or changing the order of words in a sentence.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.