Abstract

Personal genomic data constitute one important part of personal health data. However, due to the large amount of personal genomic data obtained by the next-generation sequencing technology, special tools are needed to analyze these data. In this article, we will explore a tool analyzing cloud-based large-scale genome sequencing data. Analyzing and identifying genomic variations from amplicon-based next-generation sequencing data are necessary for the clinical diagnosis and treatment of cancer patients. When processing the amplicon-based next-generation sequencing data, one essential step is removing primer sequences from the reads to avoid detecting false-positive mutations introduced by nonspecific primer binding and primer extension reactions. At present, the removing primer tools usually discard primer sequences from the FASTQ file instead of BAM file, but this method could cause some downstream analysis problems. Only one tool (BAMClipper) removes primer sequences from BAM files, but it only modified the CIGAR value of the BAM file, and false-positive mutations falling in the primer region could still be detected based on its processed BAM file. So, we developed one cutting primer tool (rmvPFBAM) removing primer sequences from the BAM file, and the mutations detected based on the processed BAM file by rmvPFBAM are highly credible. Besides that, rmvPFBAM runs faster than other tools, such as cutPrimers and BAMClipper.

Highlights

  • Genomic variations are associated with the pathogenesis and treatment of many diseases, especially cancer

  • For rmvPFBAM demonstration purpose, six patients from the SRP019940 were randomly selected to form the dataset (SRR866441, SRR866442, SRR866443, SRR866444, SRR866445, and SRR948507). e raw reads were downloaded from the European Nucleotide Archive datasets. en, the reads were aligned using the BWA software [18]. rmvPFBAM and BAMClipper were executed based on the aligned BAM files, and cutPrimers was executed based on the FASTQ files

  • Comparison of the functionality of rmvPFBAM, cutPrimers, and BAMClipper included the following parameters: (1) time of running, (2) no. of paired reads after cutting primer, (3) no. of target region reads after cutting primer, (4) no. of nontarget region reads after cutting primer, (5) no. of mutations detected based on the cutting primer BAM, and (6) no. of mutations based on the cutting primer BAM

Read more

Summary

Introduction

Genomic variations are associated with the pathogenesis and treatment of many diseases, especially cancer. There are several technologies to detect genomic variations, such as polymerase chain reaction (PCR), Sanger Sequencing, and next-generation sequencing [1]. Targeted sequencing is one commonly useful solution of next-generation sequencing focused on specific genomic regions [1]. Because targeted sequencing is cost-effective and could produce high-depth sequencing data which are able to detect low-frequency genomic variations, targeted sequencing is the most widely used approach in clinical cancer diagnosis [2]. Ampliconbased sequencing uses multiplex PCR technology to generate thousands of amplicons for massively parallel sequencing and is one of the widely used targeted sequencing technology because of its easier operation and higher amplification efficiency [4,5,6,7,8,9]. It is necessary to remove the primers before executing the downstream analysis, such as detecting mutations

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call