Measurement of Text Similarity: A Survey

Jiapeng Wang,Yihong Dong

doi:10.3390/info11090421

Abstract

Text similarity measurement is the basis of natural language processing tasks, which play an important role in information retrieval, automatic question answering, machine translation, dialogue systems, and document matching. This paper systematically combs the research status of similarity measurement, analyzes the advantages and disadvantages of current methods, develops a more comprehensive classification description system of text similarity measurement algorithms, and summarizes the future development direction. With the aim of providing reference for related research and application, the text similarity measurement method is described by two aspects: text distance and text representation. The text distance can be divided into length distance, distribution distance, and semantic distance; text representation is divided into string-based, corpus-based, single-semantic text, multi-semantic text, and graph-structure-based representation. Finally, the development of text similarity is also summarized in the discussion section.

Highlights

From the point of view of information theory [1], similarity is defined as the commonness between two text snippets
Text similarity is fast becoming a key instrument in many NLP (Natural Language Processing) based tasks, such as information retrieval [2], automatic question answering [3], machine translation [4], dialogue systems [5], and document matching [6]
The classical and new algorithms are systematically expounded and compared. These results suggest that there is an association between text distance and text representation

Summary

Introduction

From the point of view of information theory [1], similarity is defined as the commonness between two text snippets. Text similarity is fast becoming a key instrument in many NLP (Natural Language Processing) based tasks, such as information retrieval [2], automatic question answering [3], machine translation [4], dialogue systems [5], and document matching [6]. Most scholars divide text similarity measurement methods on the basis of statistics or corpus and knowledge bases, such as Wikipedia [7]. This classification ignores the text distance calculation method, and only considers the representation of the text. With the development of neural network representation learning, some semantic matching methods and graph methods need to be considered

Objectives

Discussion

Conclusion