Abstract
Gene assembly, which recovers gene segments from short reads, is an important step in functional analysis of next-generation sequencing data. Lacking quality reference genomes, de novo assembly is commonly used for RNA-Seq data of non-model organisms and metagenomic data. However, heterogeneous sequence coverage caused by heterogeneous expression or species abundance, similarity between isoforms or homologous genes, and large data size all pose challenges to de novo assembly. As a result, existing assembly tools tend to output fragmented contigs or chimeric contigs, or have high memory footprint. In this work, we introduce a targeted gene assembly program SAT-Assembler, which aims to recover gene families of particular interest to biologists. It addresses the above challenges by conducting family-specific homology search, homology-guided overlap graph construction, and careful graph traversal. It can be applied to both RNA-Seq and metagenomic data. Our experimental results on an Arabidopsis RNA-Seq data set and two metagenomic data sets show that SAT-Assembler has smaller memory usage, comparable or better gene coverage, and lower chimera rate for assembling a set of genes from one or multiple pathways compared with other assembly tools. Moreover, the family-specific design and rapid homology search allow SAT-Assembler to be naturally compatible with parallel computing platforms. The source code of SAT-Assembler is available at https://sourceforge.net/projects/sat-assembler/. The data sets and experimental settings can be found in supplementary material.
Highlights
Advances in next-generation sequencing (NGS) technologies enable sequencing of transcriptoms of a large number of nonmodel organisms (RNA-Seq) and species from various environmental samples
To show the utility of SAT-Assembler, we applied it to an Arabidopsis RNA-Seq data set and two metagenomic data sets
By mapping reads to an annotated reference genome or characterized genes from existing databases, we constructed a set of reference/target genes, which are transcribed or encoded in an NGS data set
Summary
Advances in next-generation sequencing (NGS) technologies enable sequencing of transcriptoms of a large number of nonmodel organisms (RNA-Seq) and species from various environmental samples (metagenomic data). Functional annotation is an important step in analyzing these NGS data sets. For RNA-Seq of non-model organisms and metagenomic data, which lack quality reference genomes, a commonly used functional annotation pipeline conducts de novo assembly first and applies functional annotation analysis to the assembled contigs. This pipeline has been widely adopted in functional analysis of RNA-Seq data [1,2,3,4] and gene-centric metagenomic analysis [5,6,7,8,9]. The performance of downstream functional analysis largely depends on the quality of the de novo assembly, which is still a challenging problem for RNA-Seq and metagenomic data
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.