Abstract
Real-time entity resolution (ER) is a challenging problem for large datasets. Traditional techniques of top-N join query processing are based on clean data without ER. For dirty datasets with duplicate tuples referring to the same real-world entity, these techniques may yield duplicates of top-N tuples for a query, and as a result some useful tuples may fail to be retrieved from the datasets, which leads to poor effectiveness. Based on “sorted and/or random accesses” and “no wild guesses”, in this paper, we discuss the models that integrate real-time entity resolution with top-N join queries over dirty datasets of real vectors. For finite dimensional \(\ell_{p} \) spaces and p-norm distances as nonmonotone ranking functions, using the norm equivalence theorem in Functional Analysis as a foundation, and designing buffers to join tuples with an outer-join mechanism and to cluster candidates for ER, we propose two database-friendly algorithms to answer the top-N join queries with the following two cases of data access methods: restricting sorted access and no random access. Extensive experiments are conducted to measure the effectiveness and efficiency of our approaches over various dirty datasets.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.