Abstract

BackgroundLarge biological data sets, such as expression profiles, benefit from reduction of random noise. Principal component (PC) analysis has been used for this purpose, but it tends to remove small features as well as random noise.ResultsWe interpreted the PCs as a mere signal-rich coordinate system and sorted the squared PC-coordinates of each row in descending order. The sorted squared PC-coordinates were compared with the distribution of the ordered squared random noise, and PC-coordinates for insignificant contributions were treated as random noise and nullified. The processed data were transformed back to the initial coordinates as noise-reduced data. To increase the sensitivity of signal capture and reduce the effects of stochastic noise, this procedure was applied to multiple small subsets of rows randomly sampled from a large data set, and the results corresponding to each row of the data set from multiple subsets were averaged. We call this procedure Row-specific, Sorted PRincipal component-guided Noise Reduction (RSPR-NR). Robust performance of RSPR-NR, measured by noise reduction and retention of small features, was demonstrated using simulated data sets. Furthermore, when applied to an actual expression profile data set, RSPR-NR preferentially increased the correlations between genes that share the same Gene Ontology terms, strongly suggesting reduction of random noise in the data set.ConclusionRSPR-NR is a robust random noise reduction method that retains small features well. It should be useful in improving the quality of large biological data sets.

Highlights

  • Large biological data sets, such as expression profiles, benefit from reduction of random noise

  • RSPR-NR algorithm This noise reduction procedure can be applied to an m rows × n columns (m > n) data matrix, D, in which the random noise is assumed to be normally distributed with a mean of zero

  • We could determine the optimum number of top Principal component (PC) for Principal component analysis (PCA) in the analyses shown in Figures 3 and 5 because we knew the exact signal in the simulation

Read more

Summary

Introduction

Large biological data sets, such as expression profiles, benefit from reduction of random noise. Principal component (PC) analysis has been used for this purpose, but it tends to remove small features as well as random noise. It may be possible to statistically identify and reduce such random noise, especially in large data sets. Principal component analysis (PCA), known as singular value decomposition, has been used for the purpose of statistical reduction of random noise [1]. Small variances associated with higher-order principal components (PCs) are nullified as random noise. PCA has been used for analysis of large biological data sets, such as expression profile data [1,2]. A typical form of expression profile data is a matrix with thousands of rows (genes) and tens or hundreds of columns (biological samples). Small features consisting of small numbers of rows and columns do not contribute much to the overall variance of (page number not for citation purposes)

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call