Abstract

Measuring similarities between strings is central for many established and fast-growing research areas, including information retrieval, biology, and natural-language processing. The traditional approach to string similarity measurements is to define a metric with respect to a word space that quantifies and sums up the differences between characters in two strings; surprisingly, these metrics have not evolved a great deal over the past few decades. Indeed, the majority of them are still based on making a simple comparison between character and character distributions without considering the words context. This paper proposes a string metric that encompasses similarities between strings based on (1) the character similarities between the words, including non-standard and standard spellings of the same words, and (2) the context of these words. We propose a neural network composed of a denoising autoencoder and what we call a context encoder, both specifically designed to find similarities between the words based on their context. Experimental results show that the resulting metrics have succeeded in 85.4% of the cases in finding the correct version of a non-standard spelling among the closest words, compared to 63.2% using the established Normalised-Levenshtein distance. We also show that by employing our approach, the words used in similar context are calculated to be more similar than words with different contexts, which is a desirable property lacking in established string metrics.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call