Abstract

Ranking functions used in information retrieval are primarily used in the search engines and they are often adopted for various language processing applications. This paper introduces some novel heuristics combined with probabilistic retrieval functions and are employed in the domain of approximate string similarity problem. Various algorithms have been proposed in the literature to solve approximate string similarity problems; however, none of them makes use of probabilistic retrieval functions. We are the first to explore the intersection between these two areas, that is between string similarity and information retrieval, and propose heuristic designs to resolve this problem. First, we propose chunking heuristic function, called BREAK. We show the variants BREAK-1, -2, -OFF, which split up the terms with the sequential notion. Then we propose BREAK-n, which generalizes these variants and scales to larger datasets. In order to relate these split-ups, we propose a graphical error modelling heuristics MAKE over the BREAK variants. Finally, we propose TAKE curve, a novel feature engineering probabilistic distribution, which replaces the prevalent normalization heuristics. Taking the advantage of flexibility over the choice of heuristics, we assess the variants on the cognate detection, mutant identification and problems based on isolated spelling correction. In the extensive evaluation methods, we found that our designs perform better than prevalent heuristics and are robust against database characteristics.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call