Abstract

We consider the problem of duplicate detection in the case where dealing with typographical errors, toponym matching, and datatype dependency are all combined into a single task. We express this task as a string matching problem and resolve it by estimating a conditional probability via an encoder-decoder model, whereby the strings are first encoded with a Deep Recurrent Network into context vectors which are then concatenated and used as inputs for a Deep Classifier Network. We explore the effects that different architectures have on the string matching problem when applied to duplicate detection. Finally, we test the models on numerous datasets of varying size, with some more focused on one of the datatype issues than others.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call