ASHLEYS: automated quality control for single-cell Strand-seq data.

Christina Gros,Peter Ebert,Tobias Marschall,Jan O Korbel,Ashley D Sanders

doi:10.1093/bioinformatics/btab221

Abstract

SummarySingle-cell DNA template strand sequencing (Strand-seq) enables chromosome length haplotype phasing, construction of phased assemblies, mapping sister-chromatid exchange events and structural variant discovery. The initial quality control of potentially thousands of single-cell libraries is still done manually by domain experts. ASHLEYS automates this tedious task, delivers near-expert performance and labels even large datasets in seconds.Availability and implementationgithub.com/friendsofstrandseq/ashleys-qc, MIT license.Supplementary information Supplementary data are available at Bioinformatics online.

Highlights

ASHLEYS is implemented as a Linux-only command line tool available at repository 1
As introduced in the main text, ASHLEYS uses two feature categories as predictors of Strand-seq library quality: generic sequencing library features that are not Strand-seq specific and independent of the chosen window size(s), and a set of features that is derived from the binned Watson/Crick read distribution
As a consequence, normalizing the Watson/Crick ratio features by the total number of genomic windows would distort the distribution shown in main Fig. 1B, whose expected shape for high-quality libraries is motivated by the strand segregation pattern during diploid cell division, and desirable to preserve in that form

Summary

Software

ASHLEYS is implemented as a Linux-only command line tool available at repository 1 (see below). The development environment was set up with Python v3.7 (www.python.org), Pysam v0.15.2 (github.com/pysam-developers/pysam) and scikit-learn v0.23.2 (Pedregosa et al, 2011). For an exact definition of the complete software environment, please refer to the environment file in the ASHLEYS repository under environment/ashleys env.yml. The preprocessing pipeline that exemplifies short-read alignment, marking of duplicate reads and feature computation per Strand-seq library is available at repository 2 (see below). The preprocessing pipeline is implemented in the common workflow engine Snakemake (Koster and Rahmann, 2012), and we provide setup and usage instructions as part of the repository.

Feature modeling and model training

Findings

Training and test data