A kernel-based approach for detecting outliers of high-dimensional biological data

Jung Hun Oh,Jean Gao

doi:10.1186/1471-2105-10-s4-s7

Abstract

BackgroundIn many cases biomedical data sets contain outliers that make it difficult to achieve reliable knowledge discovery. Data analysis without removing outliers could lead to wrong results and provide misleading information.ResultsWe propose a new outlier detection method based on Kullback-Leibler (KL) divergence. The original concept of KL divergence was designed as a measure of distance between two distributions. Stemming from that, we extend it to biological sample outlier detection by forming sample sets composed of nearest neighbors. KL divergence is defined between two sample sets with and without the test sample. To handle the non-linearity of sample distribution, original data is mapped into a higher feature space. We address the singularity problem due to small sample size during KL divergence calculation. Kernel functions are applied to avoid direct use of mapping functions. The performance of the proposed method is demonstrated on a synthetic data set, two public microarray data sets, and a mass spectrometry data set for liver cancer study. Comparative studies with Mahalanobis distance based method and one-class support vector machine (SVM) are reported showing that the proposed method performs better in finding outliers.ConclusionOur idea was derived from Markov blanket algorithm that is a feature selection method based on KL divergence. That is, while Markov blanket algorithm removes redundant and irrelevant features, our proposed method detects outliers. Compared to other algorithms, our proposed method shows better or comparable performance for small sample and high-dimensional biological data. This indicates that the proposed method can be used to detect outliers in biological data sets.

Highlights

In many cases biomedical data sets contain outliers that make it difficult to achieve reliable knowledge discovery
The performance of KL divergence for outlier detection (KLOD) was compared with one-class support vector machine (SVM) and Mahalanobis distance based outlier detection methods
In experiments with the two microarray data sets, specificity, sensitivity, and accuracy were measured using Principal component analysis (PCA)+LDA classification strategy after removing outliers detected by KLOD with t = 10, Mahalanobis distance based method, and one-class SVM

Summary

Introduction

In many cases biomedical data sets contain outliers that make it difficult to achieve reliable knowledge discovery. Prior to the analysis, during preprocessing it is imperative to remove outliers to prevent wrong results To detect such (page number not for citation purposes). The validity measure is capable of finding suitable values for the kernel parameter and soft margin constant Based on these parameters, SVC algorithm can identify the ideal cluster number and increase robustness to outliers and noises. Manevitz and Yousef presented two versions using the one-class SVM, both of which can identify outliers: Schölkopf's method and their proposed suggestion [10]. In such methods, after mapping the original samples into a feature space using an appropriate kernel function, the origin is referred to as the second class. LOOEsensitivity was derived from the fact that if a sample is mislabeled, flipping the label of the sample should improve the prediction power

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Apr 1, 2009
Citations: 41	License type: CC BY 2.0

R Discovery Prime

R Discovery Prime

A kernel-based approach for detecting outliers of high-dimensional biological data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Biological Data Outlier Detection Based on Kullback-Leibler Divergence
Jung Hun Oh ... Kevin Rosenblatt
-
Jung Hun Oh, et. al.Jung Hun Oh ... Kevin Rosenblatt
01 Jan 2008
01 Jan 2008

Classification of electroencephalography (EEG) signals for different mental activities using Kullback Leibler (KL) divergence
Anjum Gupta ... Cheng-Han Lee
-
Anjum Gupta, et. al.Anjum Gupta ... Cheng-Han Lee
01 Apr 2009
01 Apr 2009

Variational Autoencoder with Implicit Optimal Priors
Hiroshi Takahashi ... Satoshi Yagi
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 33
Hiroshi Takahashi, et. al.Hiroshi Takahashi ... Satoshi Yagi
17 Jul 2019
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 33

A class selection method based on a partial Kullback-Leibler information measure for biological signal classification
Taro Shibanoki ... Keisuke Shima
-
Taro Shibanoki, et. al.Taro Shibanoki ... Keisuke Shima
01 Dec 2010
01 Dec 2010

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A kernel-based approach for detecting outliers of high-dimensional biological data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics