Finite-size effects in transcript sequencing count distribution: its power-law correction necessarily precedes downstream normalization and comparative analysis

Wing-Cheong Wong,Hong-Kiat Ng,Richie Soong,Erwin Tantoso,Frank Eisenhaber

doi:10.1186/s13062-018-0204-y

Abstract

BackgroundThough earlier works on modelling transcript abundance from vertebrates to lower eukaroytes have specifically singled out the Zip’s law, the observed distributions often deviate from a single power-law slope. In hindsight, while power-laws of critical phenomena are derived asymptotically under the conditions of infinite observations, real world observations are finite where the finite-size effects will set in to force a power-law distribution into an exponential decay and consequently, manifests as a curvature (i.e., varying exponent values) in a log-log plot. If transcript abundance is truly power-law distributed, the varying exponent signifies changing mathematical moments (e.g., mean, variance) and creates heteroskedasticity which compromises statistical rigor in analysis. The impact of this deviation from the asymptotic power-law on sequencing count data has never truly been examined and quantified.ResultsThe anecdotal description of transcript abundance being almost Zipf’s law-like distributed can be conceptualized as the imperfect mathematical rendition of the Pareto power-law distribution when subjected to the finite-size effects in the real world; This is regardless of the advancement in sequencing technology since sampling is finite in practice. Our conceptualization agrees well with our empirical analysis of two modern day NGS (Next-generation sequencing) datasets: an in-house generated dilution miRNA study of two gastric cancer cell lines (NUGC3 and AGS) and a publicly available spike-in miRNA data; Firstly, the finite-size effects causes the deviations of sequencing count data from Zipf’s law and issues of reproducibility in sequencing experiments. Secondly, it manifests as heteroskedasticity among experimental replicates to bring about statistical woes. Surprisingly, a straightforward power-law correction that restores the distribution distortion to a single exponent value can dramatically reduce data heteroskedasticity to invoke an instant increase in signal-to-noise ratio by 50% and the statistical/detection sensitivity by as high as 30% regardless of the downstream mapping and normalization methods. Most importantly, the power-law correction improves concordance in significant calls among different normalization methods of a data series averagely by 22%. When presented with a higher sequence depth (4 times difference), the improvement in concordance is asymmetrical (32% for the higher sequencing depth instance versus 13% for the lower instance) and demonstrates that the simple power-law correction can increase significant detection with higher sequencing depths. Finally, the correction dramatically enhances the statistical conclusions and eludes the metastasis potential of the NUGC3 cell line against AGS of our dilution analysis.ConclusionsThe finite-size effects due to undersampling generally plagues transcript count data with reproducibility issues but can be minimized through a simple power-law correction of the count distribution. This distribution correction has direct implication on the biological interpretation of the study and the rigor of the scientific findings.ReviewersThis article was reviewed by Oliviero Carugo, Thomas Dandekar and Sandor Pongor.

Highlights

Though earlier works on modelling transcript abundance from vertebrates to lower eukaroytes have singled out the Zip’s law, the observed distributions often deviate from a single power-law slope
Finite-size effects introduces curvature in sequencing count data distributions, impacts the reproducibility of the experiment and brings about heteroskedasticity among experimental replicates Two miRNA sequencing datasets composed of technical replicates were being examined; The choice of miRNA is deliberate to avoid both transcript length bias [9] and abundance quantification [21] as confounding factors
The varying concentration design aims to simulate the different sequencing depth that mimics a system of various sizes to study its finite-size effects (See Additional file 1: Figure S1)

Summary

Introduction

Though earlier works on modelling transcript abundance from vertebrates to lower eukaroytes have singled out the Zip’s law, the observed distributions often deviate from a single power-law slope. If transcript abundance is truly power-law distributed, the varying exponent signifies changing mathematical moments (e.g., mean, variance) and creates heteroskedasticity which compromises statistical rigor in analysis. The impact of this deviation from the asymptotic power-law on sequencing count data has never truly been examined and quantified. Earlier works on modelling SAGE-derived (serial analysis of gene expression) transcript abundance from vertebrates to lower eukaroytes have singled out the power-law distribution, namely Zip’s law [3,4,5,6,7] where the slope of a power-law equation is about − 1 on a log-log scale. There exists a caveat to the power-law association: the observed power-law distribution of transcript abundance is usually imperfect in that it deviates from a single parameterized power-law slope

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Biology Direct	Publication Date: Feb 12, 2018
Citations: 1	License type: open-access

R Discovery Prime

R Discovery Prime

Finite-size effects in transcript sequencing count distribution: its power-law correction necessarily precedes downstream normalization and comparative analysis

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Biology Direct

Lead the way for us

Similar Papers

Finite-size effects for the gap in the excitation spectrum of the one-dimensional Hubbard model
M Colomé-Tatché ... S I Matveenko
Physical Review A | VOL. 81
M Colomé-Tatché, et. al.M Colomé-Tatché ... S I Matveenko
19 Jan 2010
Physical Review A | VOL. 81

Finite size effects in massive field theory
Herbert Neuberger
Physics Letters B | VOL. 233
Herbert NeubergerHerbert Neuberger
01 Dec 1989
Physics Letters B | VOL. 233

Accuracy and long-term reproducibility of lead isotopic measurements by multiple-collector inductively coupled plasma mass spectrometry using an external method for correction of mass discrimination
Mark Rehkämperab ... Alex N Halliday
International Journal of Mass Spectrometry | VOL. 181
Mark Rehkämperab, et. al.Mark Rehkämperab ... Alex N Halliday
01 Dec 1998
International Journal of Mass Spectrometry | VOL. 181

Comparison of variations detection between whole-genome amplification methods used in single-cell resequencing.
Yong Hou ...
GigaScience | VOL. 4
Yong Hou, et. al.Yong Hou ...
06 Aug 2015
GigaScience | VOL. 4

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Finite-size effects in transcript sequencing count distribution: its power-law correction necessarily precedes downstream normalization and comparative analysis

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Biology Direct