This paper develops \(\mathsf {Hercules} \) , a system for entity resolution (ER), conflict resolution (CR), timeliness deduction (TD) and missing value/link imputation (MI) in graphs. It proposes \(\mathsf {GCR^{+}\!s} \) , a class of graph cleaning rules that support not only predicates for ER and CR, but also temporal orders to deduce timeliness and data extraction to impute missing data. As opposed to previous graph rules, \(\mathsf {GCR^{+}\!s} \) are defined with a dual graph pattern to accommodate irregular structures of schemaless graphs, and adopt patterns of a star form to reduce the complexity. We show that while the implication and satisfiability problems are intractable for \(\mathsf {GCR^{+}\!s} \) , it is in PTIME to detect and correct errors with \(\mathsf {GCR^{+}\!s} \) . Underlying \(\mathsf {Hercules} \) , we train a ranking model to predict the temporal orders on attributes, and embed it as a predicate of \(\mathsf {GCR^{+}\!s} \) . We provide an algorithm for discovering \(\mathsf {GCR^{+}\!s} \) by combining the generations of patterns and predicates. We also develop a method for conducting ER, CR, TD and MI in the same process to improve the overall quality of graphs, by leveraging their interactions and chasing with \(\mathsf {GCR^{+}\!s} \) ; we show that the method has the Church-Rosser property under certain conditions. Using real-life and synthetic graphs, we empirically verify that \(\mathsf {Hercules} \) is 53% more accurate than the state-of-the-art graph cleaning systems, and performs comparably in efficiency and scalability.
Read full abstract