Abstract

Summary form only given. In this talk we present our algorithms for two important problems in processing next generation sequencing (NGS) data. We live in an era of data explosion. As an example, NCBI houses petabytes of genomic data and biologists around the world are generating 15 petabases of sequence per year. The size of metagenomic data from multiple samples could be petabytes. Biologists want to store these datasets for several reasons. Standard compression algorithms fail to do a good job on these datasets. Several approaches for compressing genomic data have been proposed in the literature. These approaches differ based on the particular type of data being compressed. Some example types are genomic data (with and without a reference), reads data (FASTA files), and FASTQ files. We have come up with algorithms for genomic data (with a reference) and FASTA file (without a reference) that perform better than some of best known algorithms for these two versions. In NGS technology, the chances of low read coverage in some regions of the sequences are very high. The reads are short and very large in number. Due to erroneous base calling, there could be errors in the reads. As a consequence, sequence assemblers often fail to sequence an entire DNA molecule and instead output a set of overlapping segments that together represent a consensus region of the DNA. The error rate of the reads can be reduced with trimming and by correcting the erroneous bases of the reads. It helps to achieve high quality data and the computational complexity of many biological applications will greatly reduce if the reads are first corrected. We have developed a novel error correcting algorithm called EC and compared it with three other well-known algorithms using both real and simulated reads. We have done extensive and rigorous experiments that reveal that EC is indeed an effective and efficient error correction tool.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call