Abstract

The preparation of a dataset by merging multiple data sources using the data fusion method may lead to the loss of vital information from each multi-source dataset and a certain amount of correlative information among the multiple data sources. Based on the extensive analysis of “the unique characteristics” of multi-source outliers, we propose multi-source outlier detection techniques to reliably identify outliers in multiple datasets. Several real-world examples are considered to classify multi-source outliers into three types (Type I-III) depending on the correlation among datasets. We design a baseline algorithm, which is an intuitive solution, and an optimal algorithm known as multiple-data-sources oriented outlier detection (MOD) to obtain high-score outliers. In addition, we build the MOD+ method to speed up the outlier detection process. A new density metric combining kNN and RNN is introduced to evaluate the deviation degrees of multi-source outliers. The new outlier are applied to develop three outlier-join operators. MOD and MOD+ are adept at (1) mining outlier information from each one of multi-source datasets and (2) sensing correlative outlier information among these multiple datasets. We implement and evaluate the three outlier detection algorithms by using synthetic and real-world datasets. The experimental results demonstrate that the proposed methods are promising and practical in the context of detecting outliers from multi-source datasets.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.