Abstract
Advances in sequencing and assembly technology have led to the creation of genome assemblies for a wide variety of non-model organisms. The rapid production and proliferation of updated, novel assembly versions can create vexing problems for researchers when multiple-genome assembly versions are available at once, requiring researchers to work with more than one reference genome. Multiple-genome assemblies are especially problematic for researchers studying the genetic makeup of individual cells, as single-cell RNA sequencing (scRNAseq) requires sequenced reads to be mapped and aligned to a single reference genome. Using the Astyanax mexicanus, this study highlights how the interpretation of a single-cell dataset from the same sample changes when aligned to its two different available genome assemblies. We found that the number of cells and expressed genes detected were drastically different when aligning to the different assemblies. When the genome assemblies were used in isolation with their respective annotations, cell-type identification was confounded, as some classic cell-type markers were assembly-specific, whilst other genes showed differential patterns of expression between the two assemblies. To overcome the problems posed by multiple-genome assemblies, we propose that researchers align to each available assembly and then integrate the resultant datasets to produce a final dataset in which all genome alignments can be used simultaneously. We found that this approach increased the accuracy of cell-type identification and maximised the amount of data that could be extracted from our single-cell sample by capturing all possible cells and transcripts. As scRNAseq becomes more widely available, it is imperative that the single-cell community is aware of how genome assembly alignment can alter single-cell data and their interpretation, especially when reviewing studies on non-model organisms.
Highlights
The use of single-cell RNA sequencing technology has greatly increased since it was first developed in 2009 [1]. scRNAseq provides transcriptome information about individual cells, enabling researchers to answer a wide variety of biological questions about topics such as cell–cell heterogeneity, tissue composition and cell-specific gene expression responses to disease and/or injury [2]
We use single-cell RNA sequencing data and two published reference assemblies from A. mexicanus to show, for the first time, that the same set of scRNAseq reads can produce different results when aligned to different genome assemblies, generating differences in matrix dimensions, gene-expression patterns and cell-type identification
To overcome the problems posed by multiple-genome assemblies that are discordantly annotated, we propose the alignment of scRNAseq samples to all available assemblies, followed by integration, to create a finalised dataset for use in downstream analysis
Summary
The use of single-cell RNA sequencing (scRNAseq) technology has greatly increased since it was first developed in 2009 [1]. scRNAseq provides transcriptome information about individual cells, enabling researchers to answer a wide variety of biological questions about topics such as cell–cell heterogeneity, tissue composition and cell-specific gene expression responses to disease and/or injury [2]. Developments in high-throughput DNA sequencing, de novo genome assembly technologies and automated genome annotation have significantly decreased the time and cost required for genome assembly construction [10,11]. This has made it feasible for individual labs to sequence and construct a genome assembly, enabling scRNAseq to be applied to any non-model organism of choice [4]. The accelerated production and publication of genome assemblies for non-model organisms is very beneficial for a wide range of biomedical research, it can result in multiple-genome assemblies for a given non-model organism to be available concurrently This is especially problematic for single-cell researchers, as scRNAseq analysis requires sequenced reads to be mapped and aligned to a single reference genome. How interpretation of a single-cell dataset might change depending on the reference to which it is aligned has not been explored
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have