Beachmat: A Bioconductor C++ API for accessing high-throughput biological data from a variety of R matrix types.

Aaron T L Lun,Hervé Pagès,Mike L Smith

doi:10.1371/journal.pcbi.1006135

Abstract

Biological experiments involving genomics or other high-throughput assays typically yield a data matrix that can be explored and analyzed using the R programming language with packages from the Bioconductor project. Improvements in the throughput of these assays have resulted in an explosion of data even from routine experiments, which poses a challenge to the existing computational infrastructure for statistical data analysis. For example, single-cell RNA sequencing (scRNA-seq) experiments frequently generate large matrices containing expression values for each gene in each cell, requiring sparse or file-backed representations for memory-efficient manipulation in R. These alternative representations are not easily compatible with high-performance C++ code used for computationally intensive tasks in existing R/Bioconductor packages. Here, we describe a C++ interface named beachmat, which enables agnostic data access from various matrix representations. This allows package developers to write efficient C++ code that is interoperable with dense, sparse and file-backed matrices, amongst others. We evaluated the performance of beachmat for accessing data from each matrix representation using both simulated and real scRNA-seq data, and defined a clear memory/speed trade-off to motivate the choice of an appropriate representation. We also demonstrate how beachmat can be incorporated into the code of other packages to drive analyses of a very large scRNA-seq data set.

Highlights

The combination of the statistical programming language R [1] and the open-source Bioconductor project [2] represents a key platform for exploring and analyzing high-throughput biological data
We describe a C++ application programming interface (API) named beachmat, which enables C++ code to access R matrix data in a manner that is agnostic to the exact matrix representation
Recent advances in scRNA-seq technologies have led to an explosion in the quantity of data that can be generated in routine experiments

Summary

Introduction

The combination of the statistical programming language R [1] and the open-source Bioconductor project [2] represents a key platform for exploring and analyzing high-throughput biological data. The use of R alone would increase the computational time required to perform analyses, which is inconvenient for beachmat: A C++ API to access biological data from R matrices interactive analyses and unacceptable for large simulation studies. The beachmat API uses C++ classes to provide a common interface for data access from R matrix representations.

Results

Conclusion