A New Retrieval Model Based on TextTiling for Document Similarity Search

Xiao-Jun Wan,Yu-Xin Peng

doi:10.1007/s11390-005-0552-9

Abstract

Document similarity search is to find documents similar to a given query document and return a ranked list of similar documents to users, which is widely used in many text and web systems, such as digital library, search engine, etc. Traditional retrieval models, including the Okapi's BM25 model and the Smart's vector space model with length normalization, could handle this problem to some extent by taking the query document as a long query. In practice, the Cosine measure is considered as the best model for document similarity search because of its good ability to measure similarity between two documents. In this paper, the quantitative performances of the above models are compared using experiments. Because the Cosine measure is not able to reflect the structural similarity between documents, a new retrieval model based on TextTiling is proposed in the paper. The proposed model takes into account the subtopic structures of documents. It first splits the documents into text segments with TextTiling and calculates the similarities for different pairs of text segments in the documents. Lastly the overall similarity between the documents is returned by combining the similarities of different pairs of text segments with optimal matching method. Experiments are performed and results show: 1) the popular retrieval models (the Okapi's BM25 model and the Smart's vector space model with length normalization) do not perform well for document similarity search; 2) the proposed model based on TextTiling is effective and outperforms other models, including the Cosine measure; 3) the methods for the three components in the proposed model are validated to be appropriately employed.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A New Retrieval Model Based on TextTiling for Document Similarity Search

Abstract

Talk to us

Similar Papers

More From: Journal of Computer Science and Technology

Lead the way for us

Journal: Journal of Computer Science and Technology	Publication Date: Jul 1, 2005
Citations: 18

Similar Papers

A Measure Based on Optimal Matching in Graph Theory for Document Similarity
Xiaojun Wan ... Yuxin Peng
-
Xiaojun Wan, et. al.Xiaojun Wan ... Yuxin Peng
01 Jan 2004
01 Jan 2004

Exploring Fairness and Accuracy of Retrieval Models
Futao Zhao ... Biao Xu
-
Futao Zhao, et. al.Futao Zhao ... Biao Xu
01 Jul 2018
01 Jul 2018

On the existence of obstinate results in vector space models
Milos Radovanović ... Mirjana Ivanović
-
Milos Radovanović, et. al.Milos Radovanović ... Mirjana Ivanović
19 Jul 2010
19 Jul 2010

Adapting pivoted document-length normalization for query size
Tze Leung Chung ... Kam Fai Wong
ACM Transactions on Asian Language Information Processing | VOL. 5
Tze Leung Chung, et. al.Tze Leung Chung ... Kam Fai Wong
01 Sep 2006
ACM Transactions on Asian Language Information Processing | VOL. 5

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A New Retrieval Model Based on TextTiling for Document Similarity Search

Abstract

Talk to us

Similar Papers

More From: Journal of Computer Science and Technology