Abstract

Dictionary-based entity extraction is an important task in many data analysis applications, such as academic search, document classification, and code auto-debugging. To improve the effectiveness of extraction, many previous studies focused on the problem of approximate dictionary-based entity extraction, which aims at finding all substrings in documents that are similar to pre-defined entities in the reference entity dictionary. However, these studies only consider syntactical similarity metrics, such as Jaccard and edit distance. In real-world scenarios, there are many cases where syntactically different strings can express the same meaning. Existing approximate entity extraction work fails to identify such kind of semantic similarity and will definitely suffer from low recall.In this paper, we come up with the new problem of an approximate dictionary-based entity extraction with synonyms and propose an end-to-end framework Aeetes to solve it. We propose a new similarity measure Asymmetric Rule-based Jaccard (JaccAR) to combine the synonym rules with syntactic similarity metrics and capture the semantic similarity expressed in the synonyms. We devise and implement a filter-and-verification based strategy to improve the overall efficiency. To this end, we propose several pruning techniques to reduce the filter cost and develop novel strategies to improve verification performance. Experimental results on three real-world datasets demonstrate the superior effectiveness and efficiency of Aeetes.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call