Framework for syntactic string similarity measures

Najlah Gali,Radu Mariescu-Istodor,Damien Hostettler,Pasi Fränti

doi:10.1016/j.eswa.2019.03.048

Najlah Gali, Radu Mariescu-Istodor + Show 2 more

Open Access

https://doi.org/10.1016/j.eswa.2019.03.048

Copy DOI

Journal: Expert Systems With Applications	Publication Date: Apr 2, 2019
Citations: 38	License type: cc-by-nc-nd

Affiliation: University of Eastern Finland

Abstract

Similarity measure is an essential component of information retrieval, document clustering, text summarization, and question answering, among others. In this paper, we introduce a general framework of syntactic similarity measures for matching short text. We thoroughly analyze the measures by dividing them into three components: character-level similarity, string segmentation, and matching technique. Soft variants of the measures are also introduced. With the help of two existing toolkits (SecondString and SimMetric), we provide an open-source Java toolkit of the proposed framework, which integrates the individual components together so that completely new combinations can be created. Experimental results reveal that the performance of the similarity measures depends on the type of the dataset. For well-maintained dataset, using a token-level measure is important but the basic (crisp) variant is usually enough. For uncontrolled dataset where typing errors are expected, the soft variants of the token-level measures are necessary. Among all tested measures, a soft token-level measure that combines set matching and q-grams at the character level perform best. A gap between human perception and syntactic measures still remains due to lacking semantic analysis.

Full Text