Abstract

Reproducibility in the statistical analyses of data from high-throughput phenotyping screens requires a robust and reliable analysis foundation that allows modelling of different possible statistical scenarios. Regular challenges are scalability and extensibility of the analysis software. In this manuscript, we describe OpenStats, a freely available software package that addresses these challenges. We show the performance of the software in a high-throughput phenomic pipeline in the International Mouse Phenotyping Consortium (IMPC) and compare the agreement of the results with the most similar implementation in the literature. OpenStats has significant improvements in speed and scalability compared to existing software packages including a 13-fold improvement in computational time to the current production analysis pipeline in the IMPC. Reduced complexity also promotes FAIR data analysis by providing transparency and benefiting other groups in reproducing and re-usability of the statistical methods and results. OpenStats is freely available under a Creative Commons license at www.bioconductor.org/packages/OpenStats.

Highlights

  • Statistics is the main inferential tool used in science and medicine to extract information from data

  • We address the issues of scalability, extensibility, versatility, and efficiency in the current International Mouse Phenotyping Consortium (IMPC) statistical pipeline implemented using the R package PhenStat by introducing a new package that we call OpenStats in the same development environment, R

  • The R package allows a fully customised analysis plan for the implemented methods namely: linear mixed model, Fisher’s exact test and Reference Range plus, as well as a comprehensive workflow with a focus on simplicity, efficiency, scalability and completeness that offers more than the raw statistical results and more than the counterparts in the literature

Read more

Summary

Introduction

Statistics is the main inferential tool used in science and medicine to extract information from data. It provides a set of proven steps for drawing conclusions and making decisions in spite of the uncertainty inherent in any data, which are unavoidable due to biological variation as well as the constraints of cost, time, and measurement precision. Studies have shown irreproducibility of results is often due to poor documentation of the statistical method [7,8,9]. This is especially critical for the high-throughput phenomic screening when tens of thousands of data points are generated and analysed

Objectives
Methods
Findings
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call