Abstract

This paper reports on experiments performed to investigate the use of syntactical structures of sentences combined with sentences' terms for document similarity calculation. The document's sentences were first converted into ordered Part of Speech (POS) tags that were then fed into the Longest Common Subsequence (LCS) algorithm to determine the size and count of the LCSs found when comparing the document sentence by sentence. As a first stage, these syntactical features of the text were used as a structural representation of the document’s text. However, the produced strings of tags not only work as text representative but also provide for text size reduction. This improves the processing efficiency of comparing the document's representative strings using the LCS. A score is generated by computing an accumulative value based on the number of the LCSs found. In the second stage, documents that score well in the first stage are subjected to further comparison using the actual words of the sentences (content) in a sentence by sentence fashion. An overall final is generated as a measure of similarity using the common words (accumulated for the whole document) and the total number of LCSs from the first step. Experiments were done on two different corpora. Results obtained have showed the utility of the proposed procedure in calculating similarities between written documents. The overall discrimination power was maintained while the size of the documents was reduced using only a representative of the document based on the tagged string.

Highlights

  • With the growth of the web and the emergence of digital libraries, document management, text analysis and similarity calculations have become an important text processing technique

  • The percentage is interesting when we look at the percentages for the wholly and partially Press Association (PA)-derived documents

  • For the Non-PA derived stories the procedure give slightly lower percentage of positive documents. This is interesting as there is no direct reuse of PA articles but still the documents are considered similar in content since they report on the same stories

Read more

Summary

Introduction

With the growth of the web and the emergence of digital libraries, document management, text analysis and similarity calculations have become an important text processing technique. Text similarity is an integral part of many such applications [22, 23, 30] It is common task shared among various applications ranging from copy detection [16, 19], near-copy detection [16, 11] plagiarism [9, 10, 12], IR systems [1, 14] and computational biology [2, 4, 6, 21]. Many such applications [26, 27, 28, 29] employee a combination of techniques and apply to multidisciplinary fields [7, 8, 13]

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call