Abstract
With advances in data collection technologies, multiple data sources are assuming increasing prominence in many applications. Clustering from multiple data sources has emerged as a topic of critical significance in the data mining and machine learning community. Different data sources provide different levels of necessarily detailed knowledge. Thus, combining multiple data sources is pivotal to facilitate the clustering process. However, in reality, the data usually exhibits heterogeneity and incompleteness. The key challenge is how to effectively integrate information from multiple heterogeneous sources in the presence of missing data. Conventional methods mainly focus on clustering heterogeneous data with full information in all sources or at least one source without missing values. In this paper, we propose a more general framework T-MIC (Tensor based Multi-source Incomplete data Clustering) to integrate multiple incomplete data sources. Specifically, we first use the kernel matrices to form an initial tensor across all the multiple sources. Then we formulate a joint tensor factorization process with the sparsity constraint and use it to iteratively push the initial tensor towards a quality-driven exploration of the latent factors by taking into account missing data uncertainty. Finally, these factors serve as features to clustering. Extensive experiments on both synthetic and real datasets demonstrate that our proposed approach can effectively boost clustering performance, even with large amounts of missing data.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.