Abstract

Textual similarity among documents often leads to copyright issues. Manual measurement of similarity among documents is a time consuming infeasible activity. In this paper, we proposed a technique for measuring similarity at sensed-lexicon level for documents written in Punjabi language using Gurumukhi script. 50 Punjabi document pairs were manually collected with the help of Punjabi native writers. The proposed technique consisted of major 4 levels. Level 0 consists of data collection phase. Level 1 consists of noise removal and stop word removal sub levels. Extracted tokens were stemmed, lemmatized and synonyms were replaced based on part of speech tagging in level 2. Vector space representation corresponding to each document leads to n-gram generation of documents in level 2. Extracted n-grams were weighted based on term frequency. In level 3, string based token level similarity indexes such as Jaccard Similarity Index (JSI), Cosine Similarity Index (CSI) and Levenshtien Distance Index (LDI) were experimented with weighed tokens. In this work, Human Intelligence Task (HIT) based rating has been utilized for measuring the similarity among documents between 0-100. Results obtained from HIT based rating are compared with results obtained from the proposed technique with various combinations of pre-processing levels. Results revealed that on the basis of majority voting, combination of stop word removal with stemming and ‘noun’ based synonym replacement leads to the best combination with bi-gram tokens. Statistical analysis indicates strong correlation between CSI and HIT based rating.

Highlights

  • Textual similarity among documents often leads to copyright issues

  • Similarity between Punjabi documents has been measured at lexical level with different combination of pre-processing techniques

  • These document pairs were passed through various pre-processing techniques such as stop word removal, stemming, part of speech based synonym replacement with the help of IndoWordNet

Read more

Summary

Introduction

Textual similarity among documents often leads to copyright issues. Manual measurement of similarity among documents is a time consuming infeasible activity. We proposed a technique for measuring similarity at sensedlexicon level for documents written in Punjabi language using Gurumukhi script. Human Intelligence Task (HIT) based rating has been utilized for measuring the similarity among documents between 0-100. Results revealed that on the basis of majority voting, combination of stop word removal with stemming and ‘noun’ based synonym replacement leads to the best combination with bi-gram tokens. Document level similarity is identified at sensedlexicon level These documents are written in Punjabi language using Gurumukhi script which adds one more layer of complexity to this task. A lot of research has been carried out in area of measuring similarity among documents written in foreign languages, especially English.

Objectives
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.