EasyCluster2: an improved tool for clustering and assembling long transcriptome reads.

Vitoantonio Bevilacqua,Ely Ignazio Giannino,Domenico Simone,Graziano Pesole,Ernesto Picardi,Nicola Pietroleonardo,Fabio Stroppa

doi:10.1186/1471-2105-15-s15-s7

Abstract

BackgroundExpressed sequences (e.g. ESTs) are a strong source of evidence to improve gene structures and predict reliable alternative splicing events. When a genome assembly is available, ESTs are suitable to generate gene-oriented clusters through the well-established EasyCluster software. Nowadays, EST-like sequences can be massively produced using Next Generation Sequencing (NGS) technologies. In order to handle genome-scale transcriptome data, we present here EasyCluster2, a reimplementation of EasyCluster able to speed up the creation of gene-oriented clusters and facilitate downstream analyses as the assembly of full-length transcripts and the detection of splicing isoforms.ResultsEasyCluster2 has been developed to facilitate the genome-based clustering of EST-like sequences generated through the NGS 454 technology. Reads mapped onto the reference genome can be uploaded using the standard GFF3 file format. Alignment parsing is initially performed to produce a first collection of pseudo-clusters by grouping reads according to the overlap of their genomic coordinates on the same strand. EasyCluster2 then refines read grouping by including in each cluster only reads sharing at least one splice site and optionally performs a Smith-Waterman alignment in the region surrounding splice sites in order to correct for potential alignment errors. In addition, EasyCluster2 can include unspliced reads, which generally account for >50% of 454 datasets, and collapses overlapping clusters. Finally, EasyCluster2 can assemble full-length transcripts using a Directed-Acyclic-Graph-based strategy, simplifying the identification of alternative splicing isoforms, thanks also to the implementation of the widespread AStalavista methodology. Accuracy and performances have been tested on real as well as simulated datasets.ConclusionsEasyCluster2 represents a unique tool to cluster and assemble transcriptome reads produced with 454 technology, as well as ESTs and full-length transcripts. The clustering procedure is enhanced with the employment of genome annotations and unspliced reads. Overall, EasyCluster2 is able to perform an effective detection of splicing isoforms, since it can refine exon-exon junctions and explore alternative splicing without known reference transcripts. Results in GFF3 format can be browsed in the UCSC Genome Browser. Therefore, EasyCluster2 is a powerful tool to generate reliable clusters for gene expression studies, facilitating the analysis also to researchers not skilled in bioinformatics.

Highlights

Expressed sequences (e.g. Expressed sequence tags (ESTs)) are a strong source of evidence to improve gene structures and predict reliable alternative splicing events
GFF3 parsing and first clustering In contrast with the previous version, EasyCluster2 accepts as input alignment files in thestandard GFF3 format and parse them in memory exploiting JAVA classes of a custom library
EasyCluster2 is a reimplementation of EasyCluster software devoted to the generation of gene-oriented clusters by massive transcriptome reads

Summary

Introduction

Expressed sequences (e.g. ESTs) are a strong source of evidence to improve gene structures and predict reliable alternative splicing events. Expressed sequence tags (ESTs) and full-length cDNAs (FL-cDNAs) are an invaluable source of evidence to infer reliable gene structures and discover potential alternative splicing events [1]. Their biological potential can be fully to generate, through the GS FLX+ Titanium chemistry, sequence reads up to 1Kb long (http://www.454.com/) [5]. Handling huge amount of EST-like data is extremely useful to detect alternative isoforms, improve gene annotations or create gene-oriented clusters for expression studies. While wcd implements a new algorithm based on suffix arrays to handle huge amount of reads generated by high-throughput sequencers [8], RCDA has been conceived only for ESTs produced by the classical Sanger sequencing and, never tested on long sequences produced by generation technologies as those from Roche 454 [7]

Methods

Results

Discussion

Conclusion