PEAR: a fast and accurate Illumina Paired-End reAd mergeR

Jiajie Zhang,Alexandros Stamatakis,Tomáš Flouri,Kassian Kobert

doi:10.1093/bioinformatics/btt593

Abstract

Motivation: The Illumina paired-end sequencing technology can generate reads from both ends of target DNA fragments, which can subsequently be merged to increase the overall read length. There already exist tools for merging these paired-end reads when the target fragments are equally long. However, when fragment lengths vary and, in particular, when either the fragment size is shorter than a single-end read, or longer than twice the size of a single-end read, most state-of-the-art mergers fail to generate reliable results. Therefore, a robust tool is needed to merge paired-end reads that exhibit varying overlap lengths because of varying target fragment lengths.Results: We present the PEAR software for merging raw Illumina paired-end reads from target fragments of varying length. The program evaluates all possible paired-end read overlaps and does not require the target fragment size as input. It also implements a statistical test for minimizing false-positive results. Tests on simulated and empirical data show that PEAR consistently generates highly accurate merged paired-end reads. A highly optimized implementation allows for merging millions of paired-end reads within a few minutes on a standard desktop computer. On multi-core architectures, the parallel version of PEAR shows linear speedups compared with the sequential version of PEAR.Availability and implementation: PEAR is implemented in C and uses POSIX threads. It is freely available at http://www.exelixis-lab.org/web/software/pear.Contact: Tomas.Flouri@h-its.org

Highlights

The Illumina sequencing platform can produce millions of short reads in a single run
To identify false-positive merged reads, we propose a statistical test that is based on the observed expected alignment scores (OESs)
We set the parameters of ART to generate target DNA fragments by randomly sampling the reference sequences until a 10-fold coverage of the reference dataset was reached

Summary

INTRODUCTION

The Illumina sequencing platform can produce millions of short reads in a single run. In contrast to FLASH, PANDAseq works well with short overlap regions and does not require prior knowledge of the target DNA fragment size It assumes that all paired-end reads can be merged. Most current paired-end mergers assume that the DNA fragments are longer than the individual single-end reads When this does not hold, for example when sequencing the V6 region of 16S rRNA genes of bacterial samples [fragment sizes range between 110 and 130-bp (Gloor et al, 2010)] with read lengths of 150-bp (see case C in Fig. 1), current mergers will generate erroneous results. The program is accurate on datasets with (i) short overlaps and (ii) DNA target fragment sizes that are smaller than single-end read lengths. The parallel version of PEAR scales linearly with the number of cores

IMPLEMENTATION

Overlap algorithm

Statistical test

Output

Parallelization and memory management

RESULTS AND DISCUSSION

Simulated data

Single known sequence data

Run time and memory requirement

Reasons for high FPRs in PANDASeq

CONCLUSIONS

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Computer applications in the biosciences : CABIOS	Publication Date: Oct 18, 2013
Citations: 3441	License type: CC BY 3.0

R Discovery Prime

R Discovery Prime

PEAR: a fast and accurate Illumina Paired-End reAd mergeR

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Computer applications in the biosciences : CABIOS

Lead the way for us

Similar Papers

Hybrid-denovo: a de novo OTU-picking pipeline integrating single-end and paired-end 16S sequence tags.
Xianfeng Chen ... Nicholas Chia
GigaScience | VOL. 7
Xianfeng Chen, et. al.Xianfeng Chen ... Nicholas Chia
15 Dec 2017
GigaScience | VOL. 7

The impact of read length on quantification of differentially expressed genes and splice junction detection.
Sagar Chhangawala ... Gabe Rudy
Genome Biology | VOL. 16
Sagar Chhangawala, et. al.Sagar Chhangawala ... Gabe Rudy
23 Jun 2015
Genome Biology | VOL. 16

PERGA: a paired-end read guided de novo assembler for extending contigs using SVM and look ahead approach.
Xiao Zhu ... Yan Zhang
PloS one | VOL. 9
Xiao Zhu, et. al.Xiao Zhu ... Yan Zhang
02 Dec 2014
PloS one | VOL. 9

Short paired-end reads trump long single-end reads for expression analysis
Adam H Freedman ... Timothy B Sackton
BMC bioinformatics | VOL. 21
Adam H Freedman, et. al.Adam H Freedman ... Timothy B Sackton
19 Apr 2020
BMC bioinformatics | VOL. 21

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

PEAR: a fast and accurate Illumina Paired-End reAd mergeR

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Computer applications in the biosciences : CABIOS