Abstract
This paper proposes an approach for entity resolution (ER) and conflict resolution (CR) in large-scale graphs. It is based on a class of Graph Cleaning Rules (GCRs), which support the primitives of relational data cleaning rules, and may embed machine learning classifiers as predicates. As opposed to previous graph rules, GCRs are defined with a dual graph pattern to accommodate irregular structures of schemaless graphs, and adopt patterns of a star form to reduce the complexity. We show that the satisfiability, implication and validation problems are all in polynomial time (PTIME) for GCRs, as opposed to the intractability of these classical problems for previous graph dependencies. We develop a parallel algorithm to discover GCRs by combining the generations of patterns and predicates, and a parallel PTIME algorithm for "deep" ER and CR by recursively applying the mined GCRs. We show that these algorithms guarantee to reduce runtime when more processors are used. Using real-life and synthetic graphs, we experimentally verify that rule discovery and error detection with GCRs are substantially faster than with previous graph dependencies, with improved accuracy.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.