Abstract

This paper develops \(\mathsf {Hercules} \) , a system for entity resolution (ER), conflict resolution (CR), timeliness deduction (TD) and missing value/link imputation (MI) in graphs. It proposes \(\mathsf {GCR^{+}\!s} \) , a class of graph cleaning rules that support not only predicates for ER and CR, but also temporal orders to deduce timeliness and data extraction to impute missing data. As opposed to previous graph rules, \(\mathsf {GCR^{+}\!s} \) are defined with a dual graph pattern to accommodate irregular structures of schemaless graphs, and adopt patterns of a star form to reduce the complexity. We show that while the implication and satisfiability problems are intractable for \(\mathsf {GCR^{+}\!s} \) , it is in PTIME to detect and correct errors with \(\mathsf {GCR^{+}\!s} \) . Underlying \(\mathsf {Hercules} \) , we train a ranking model to predict the temporal orders on attributes, and embed it as a predicate of \(\mathsf {GCR^{+}\!s} \) . We provide an algorithm for discovering \(\mathsf {GCR^{+}\!s} \) by combining the generations of patterns and predicates. We also develop a method for conducting ER, CR, TD and MI in the same process to improve the overall quality of graphs, by leveraging their interactions and chasing with \(\mathsf {GCR^{+}\!s} \) ; we show that the method has the Church-Rosser property under certain conditions. Using real-life and synthetic graphs, we empirically verify that \(\mathsf {Hercules} \) is 53% more accurate than the state-of-the-art graph cleaning systems, and performs comparably in efficiency and scalability.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.