Abstract

PURPOSEMethods for depth normalization have been assessed primarily with simulated data or cell-line–mixture data. There is a pressing need for benchmark data enabling a more realistic and objective assessment, especially in the context of small RNA sequencing.METHODSWe collected a unique pair of microRNA sequencing data sets for the same set of tumor samples; one data set was collected with and the other without uniform handling and balanced design. The former provided a benchmark for evaluating evidence of differential expression and the latter served as a test bed for normalization. Next, we developed a data perturbation algorithm to simulate additional data set pairs. Last, we assembled a set of computational tools to visualize and quantify the assessment.RESULTSWe validated the quality of the benchmark data and showed the need for normalization of the test data. For illustration, we applied the data and tools to assess the performance of 9 existing normalization methods. Among them, trimmed mean of M-values was a better scaling method, whereas the median and the upper quartiles were consistently the worst performers; one variation of remove unwanted variation had the best chance of capturing true positives but at the cost of increased false positives. In general, these methods were, at best, moderately helpful when the level of differential expression was extensive and asymmetric.CONCLUSIONOur study (1) provides the much-needed benchmark data and computational tools for assessing depth normalization, (2) shows the dependence of normalization performance on the underlying pattern of differential expression, and (3) calls for continued research efforts to develop more effective normalization methods.

Highlights

  • Several analytic methods have been proposed for normalizing sequencing depth

  • The former provided a benchmark for evaluating evidence of differential expression and the latter served as a test bed for normalization

  • Among them, trimmed mean of M-values was a better scaling method, whereas the median and the upper quartiles were consistently the worst performers; one variation of remove unwanted variation had the best chance of capturing true positives but at the cost of increased false positives

Read more

Summary

Introduction

Several analytic methods have been proposed for normalizing sequencing depth. Earlier methods were based mostly on the scaling strategy, which calculates a scaling factor (eg, the total number of counts) for each sample to adjust the data.[1,2,3] Later, moreinvolved methods based on regression (eg, with regard to selected principal components of all or some markers) were proposed on the basis of empirical observations that depth does not influence sequencing data in a simple overall shifting manner and concerns that it can be complicated by other nonspecific sources of handling variations.[4,5,6] Many of these methods were developed in the context of differential expression analysis, and their performance has been assessed mostly using parametrically simulated data and/or cell-line–mixture data that may not realistically reflect the distributional characteristics of sequencing data.[1,2,4]We set out to develop the data and analytics to enable a more realistic and objective assessment of depth normalization methods, focusing on a class of small RNAs called microRNAs (miRNAs). Several analytic methods have been proposed for normalizing sequencing depth. Earlier methods were based mostly on the scaling strategy, which calculates a scaling factor (eg, the total number of counts) for each sample to adjust the data.[1,2,3] Later, moreinvolved methods based on regression (eg, with regard to selected principal components of all or some markers) were proposed on the basis of empirical observations that depth does not influence sequencing data in a simple overall shifting manner and concerns that it can be complicated by other nonspecific sources of handling variations.[4,5,6] Many of these methods were developed in the context of differential expression analysis, and their performance has been assessed mostly using parametrically simulated data and/or cell-line–mixture data that may not realistically reflect the distributional characteristics of sequencing data.[1,2,4]. We set out to develop the data and analytics to enable a more realistic and objective assessment of depth normalization methods, focusing on a class of small RNAs called microRNAs (miRNAs). MiRNAs are 18 to 22 nucleotides long, which minimizes the potential bias in abundance estimation due to gene length variation, as seen in RNAs.[7,8] They play an important regulatory role in gene expression in the cell and are closely linked to cell apoptosis and carcinogenesis.[9,10]

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call