Abstract

Single-cell RNA-seq (scRNA-seq) data simulation is critical for evaluating computational methods for analysing scRNA-seq data especially when ground truth is experimentally unattainable. The reliability of evaluation depends on the ability of simulation methods to capture properties of experimental data. However, while many scRNA-seq data simulation methods have been proposed, a systematic evaluation of these methods is lacking. We develop a comprehensive evaluation framework, SimBench, including a kernel density estimation measure to benchmark 12 simulation methods through 35 scRNA-seq experimental datasets. We evaluate the simulation methods on a panel of data properties, ability to maintain biological signals, scalability and applicability. Our benchmark uncovers performance differences among the methods and highlights the varying difficulties in simulating data characteristics. Furthermore, we identify several limitations including maintaining heterogeneity of distribution. These results, together with the framework and datasets made publicly available as R packages, will guide simulation methods selection and their future development.

Highlights

  • Single-cell RNA-seq data simulation is critical for evaluating computational methods for analysing scRNA-seq data especially when ground truth is experimentally unattainable

  • Zero-inflated negative binomial (NB) (ZINB) model takes account of excessive zeros in the count data and is chosen by other studies to better model the sparsity in single-cell data[7,8]

  • We presented a comprehensive benchmark study assessing the performance of 12 single-cell simulation methods using 35 datasets and a total of 25 criteria across four aspects of interest

Read more

Summary

Introduction

Single-cell RNA-seq (scRNA-seq) data simulation is critical for evaluating computational methods for analysing scRNA-seq data especially when ground truth is experimentally unattainable. To effectively utilise scRNA-seq data to address biological questions[2], the development of computational tools for analysing such data is critical and has grown exponentially with the increasing availability of scRNA-seq datasets Evaluation of their performance with credible ground truth has become a key task for assessing the quality and robustness of the growing array of computational resources. Considering that realistic simulation datasets are intended to reflect experimental datasets in all data moments including both cell-wise and gene-wise properties, as well as their higher-order interactions, it is important to determine how well simulation methods represent all these values To this end, we systematically compare the performance of simulation methods across multiple sets of criteria, including accuracy of estimates for data properties, the ability to retain biological signals and to achieve computation scalability, as well as their applicability. We summarise the result into recommendation to the users, and highlight potential areas requiring future research

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.