Deep String Matching For Duplicate Detection

Alexandre Bloch,Daniel Alexandre Bloch

doi:10.2139/ssrn.3847416

Deep String Matching For Duplicate Detection

Alexandre Bloch, Daniel Alexandre Bloch

https://doi.org/10.2139/ssrn.3847416

Copy DOI

Journal: SSRN

Publication Date: May 16, 2021

Affiliation: University of Edinburgh

#String Matching Problem #Deep Recurrent Network + Show 8 more

Abstract
Full-Text PDF
Similar Papers

Abstract

We consider the problem of duplicate detection in the case where dealing with typographical errors, toponym matching, and datatype dependency are all combined into a single task. We express this task as a string matching problem and resolve it by estimating a conditional probability via an encoder-decoder model, whereby the strings are first encoded with a Deep Recurrent Network into context vectors which are then concatenated and used as inputs for a Deep Classifier Network. We explore the effects that different architectures have on the string matching problem when applied to duplicate detection. Finally, we test the models on numerous datasets of varying size, with some more focused on one of the datatype issues than others.

Full Text