Pairwise document similarity measure based on present term set

Marzieh Oghbaie,Morteza Mohammadi Zanjireh

doi:10.1186/s40537-018-0163-2

Marzieh Oghbaie, Morteza Mohammadi Zanjireh

Open Access

https://doi.org/10.1186/s40537-018-0163-2

Copy DOI

Journal: Journal of Big Data	Publication Date: Dec 1, 2018
Citations: 34	License type: open-access

Affiliation: Imam Khomeini International University

Abstract

Measuring pairwise document similarity is an essential operation in various text mining tasks. Most of the similarity measures judge the similarity between two documents based on the term weights and the information content that two documents share in common. However, they are insufficient when there exist several documents with an identical degree of similarity to a particular document. This paper introduces a novel text document similarity measure based on the term weights and the number of terms appeared in at least one of the two documents. The effectiveness of our measure is evaluated on two real-world document collections for a variety of text mining tasks, such as text document classification, clustering, and near-duplicates detection. The performance of our measure is compared with that of some popular measures. The experimental results showed that our proposed similarity measure yields more accurate results.

Highlights

In text mining, a similarity measure is the quintessential way to calculate the similarity between two text documents, and is widely used in various Machine Learning (ML) methods, including clustering and classification
Based on the modified version of preferable properties for a similarity measure, we propose a new similarity measure, called pairwise document similarity measure (PDSM)
Results and discussion we provide the results of our experiments and compare the performance of PDSM with that of other similarity measures used in k Nearest Neighbors (kNN), K-means, and the shingling algorithm

Summary

Introduction

A similarity (or distance) measure is the quintessential way to calculate the similarity between two text documents, and is widely used in various Machine Learning (ML) methods, including clustering and classification. ML methods help learn from enormous collections, known as big data [1, 2]. Among ML methods, classification and clustering help discover patterns and correlations and extract information from large-scale collections [1]. These two techniques offer benefits to different IR applications. Document clustering can be applied to the document collection to improve search speed, precision, and recall or to the search results to provide more effective information presentation to user [3]. Document classification is used in vertical search engines [4] and sentiment detection [5]

Objectives

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Pairwise document similarity measure based on present term set

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Big Data

Lead the way for us

Similar Papers

Experiments on keyword list generation by term distribution clustering for text classification
Wilson Fonda ... Ayu Purwarianti
-
Wilson Fonda, et. al.Wilson Fonda ... Ayu Purwarianti
01 Oct 2014
01 Oct 2014

A new neutrosophic TF-IDF term weighting for text mining tasks: text classification use case
Mariem Bounabi ... Karim Elmoutaouakil
International Journal of Web Information Systems | VOL. 17
Mariem Bounabi, et. al.Mariem Bounabi ... Karim Elmoutaouakil
08 Apr 2021
International Journal of Web Information Systems | VOL. 17

PubMed-supported clinical term weighting approach for improving inter-patient similarity measure in diagnosis prediction.
Lawrence Wc Chan ... Tao Chan
BMC medical informatics and decision making | VOL. 15
Lawrence Wc Chan, et. al.Lawrence Wc Chan ... Tao Chan
02 Jun 2015
BMC medical informatics and decision making | VOL. 15

New similarity measures for single-valued neutrosophic sets with applications in pattern recognition and medical diagnosis problems
Jia Syuen Chai ... Bay Vo
Complex & Intelligent Systems | VOL. 7
Jia Syuen Chai, et. al.Jia Syuen Chai ... Bay Vo
07 Dec 2020
Complex & Intelligent Systems | VOL. 7

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Pairwise document similarity measure based on present term set

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Big Data