A Data-Driven Approach for Extracting Representative Information From Large Datasets With Mixed Attributes

Feng Wu,Bin Jiang,Xin Huang

doi:10.1109/tem.2019.2934485

Abstract

The rapid growth of information technology and Internet applications has provided users with an explosion of information. Mobile e-commerce applications and web search engines are of great interest in extracting representative information from the original abundant information. However, the information extracted by several existing methods, such as top-k, are often quite similar, which is difficult to meet users’ demand for diversified information. In order to increase the diversity of representative information, this article proposes a data-driven approach to automatically identifying a subset of the original dataset that can cover more themes and content. The data-driven approach consists of two stages. First, a new unified similarity measure is proposed for handling dataset with categorical and numeric attributes. We inject external knowledge and attribute interactions into the similarity learning process to improve the accuracy of similarity estimation between data objects. Second, we develop an enhanced density peaks clustering algorithm based on shared nearest neighbors to automatically identify representative objects according to the previous estimated similarity. The enhanced density peaks algorithm takes the local structure in the entire data space into consideration, which makes the proposed approach relatively insensitive to variations in dataset’ density and dimensionality. Theoretical analysis demonstrates that the time complexity of the proposed approach can achieve the best $\bm {O}({\bm {N}\log \bm {N}})$ . Extensive comparison experiments were conducted on artificial and real-world datasets. The experimental results demonstrate the effectiveness and robustness of the proposed approach.

Full Text