Abstract
BackgroundIn many cases biomedical data sets contain outliers that make it difficult to achieve reliable knowledge discovery. Data analysis without removing outliers could lead to wrong results and provide misleading information.ResultsWe propose a new outlier detection method based on Kullback-Leibler (KL) divergence. The original concept of KL divergence was designed as a measure of distance between two distributions. Stemming from that, we extend it to biological sample outlier detection by forming sample sets composed of nearest neighbors. KL divergence is defined between two sample sets with and without the test sample. To handle the non-linearity of sample distribution, original data is mapped into a higher feature space. We address the singularity problem due to small sample size during KL divergence calculation. Kernel functions are applied to avoid direct use of mapping functions. The performance of the proposed method is demonstrated on a synthetic data set, two public microarray data sets, and a mass spectrometry data set for liver cancer study. Comparative studies with Mahalanobis distance based method and one-class support vector machine (SVM) are reported showing that the proposed method performs better in finding outliers.ConclusionOur idea was derived from Markov blanket algorithm that is a feature selection method based on KL divergence. That is, while Markov blanket algorithm removes redundant and irrelevant features, our proposed method detects outliers. Compared to other algorithms, our proposed method shows better or comparable performance for small sample and high-dimensional biological data. This indicates that the proposed method can be used to detect outliers in biological data sets.
Highlights
In many cases biomedical data sets contain outliers that make it difficult to achieve reliable knowledge discovery
The performance of KL divergence for outlier detection (KLOD) was compared with one-class support vector machine (SVM) and Mahalanobis distance based outlier detection methods
In experiments with the two microarray data sets, specificity, sensitivity, and accuracy were measured using Principal component analysis (PCA)+LDA classification strategy after removing outliers detected by KLOD with t = 10, Mahalanobis distance based method, and one-class SVM
Summary
In many cases biomedical data sets contain outliers that make it difficult to achieve reliable knowledge discovery. Prior to the analysis, during preprocessing it is imperative to remove outliers to prevent wrong results To detect such (page number not for citation purposes). The validity measure is capable of finding suitable values for the kernel parameter and soft margin constant Based on these parameters, SVC algorithm can identify the ideal cluster number and increase robustness to outliers and noises. Manevitz and Yousef presented two versions using the one-class SVM, both of which can identify outliers: Schölkopf's method and their proposed suggestion [10]. In such methods, after mapping the original samples into a feature space using an appropriate kernel function, the origin is referred to as the second class. LOOEsensitivity was derived from the fact that if a sample is mislabeled, flipping the label of the sample should improve the prediction power
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.