Abstract

Because of the open-collaborative of online encyclopedias, a large number of knowledge triples are improperly classified in online encyclopedia systems, and it is necessary to denoise and refine the open-domain encyclopedia Knowledge Bases (KBs) to improve the quality and precision. However, the lack and inaccuracy of triple semantic features lead to a poor refinement effect. In addition, considering large-scale encyclopedia KBs, the processing of massive knowledge will lead to too much computing time and poor scalability of the algorithm. To solve the problems of knowledge denoising in the Chinese encyclopedia system, first, based on data field theory, this paper proposes a new Cartesian product mapping-based method (TripleES) to calculate the semantic similarity of entity triples, based on which a method for quantifying the quality of entry tags is proposed. Second, to further improve the denoising effect on KBs, this paper proposes a new method (TriplePV) to compute the potential value of triple based on multi-feature fusion strategy to calculate the semantic distance between the “out-of-vocabulary” entry tags and embeds it into the potential function. Third, to ensure our algorithms have good scalability, the proposed denoising algorithms are implemented and optimized in parallel based on the Spark cluster-computing framework. Specifically, Spark-based TripleES (ES_Spark) and Spark-based TriplePV (PV_Spark) algorithms are proposed to calculate the semantic similarity and potential value of triples respectively. Finally, a comprehensive comparative analysis is performed on the denoising effect and time efficiency with the state-of-the-art distributed Chinese encyclopedia knowledge denoising algorithm. The experimental results on real-world datasets show that the parallel denoising algorithm proposed in this paper can improve the efficiency of knowledge denoising and the accuracy of KBs, which outperforms the state-of-the-art methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call