Abstract
Various types of analyses performed over multi-omics data are driven today by next-generation sequencing (NGS) techniques that produce large volumes of DNA/RNA sequences. Although many tools allow for parallel processing of NGS data in a Big Data distributed environment, they do not facilitate the improvement of the quality of NGS data for a large scale in a simple declarative manner. Meanwhile, large sequencing projects and routine DNA/RNA sequencing associated with molecular profiling of diseases for personalized treatment require both good quality data and appropriate infrastructure for efficient storing and processing of the data. To solve the problems, we adapt the concept of Data Lake for storing and processing big NGS data. We also propose a dedicated library that allows cleaning the DNA/RNA sequences obtained with single-read and paired-end sequencing techniques. To accommodate the growth of NGS data, our solution is largely scalable on the Cloud and may rapidly and flexibly adjust to the amount of data that should be processed. Moreover, to simplify the utilization of the data cleaning methods and implementation of other phases of data analysis workflows, our library extends the declarative U-SQL query language providing a set of capabilities for data extraction, processing, and storing. The results of our experiments prove that the whole solution supports requirements for ample storage and highly parallel, scalable processing that accompanies NGS-based multi-omics data analyses.
Highlights
Several commercially available sequencing platforms on the market today allow thousands or even millions of DNA/mRNA sequence fragments to be obtained simultaneously
The presented Data Lake-based approach was tested to verify the quality of results and performance of the next-generation sequencing (NGS) data cleaning
We checked whether the NGS data processed and cleaned with the use of the developed library are identical to those obtained as a result of analogical processing performed on local workstations with the Trimmomatic program
Summary
Several commercially available sequencing platforms on the market today allow thousands or even millions of DNA/mRNA sequence fragments (sequence reads) to be obtained simultaneously. Raw data obtained once the sequencing is complete include a set of many short genome sequence reads that usually undergo several phases of data analysis. The NGS data pre-processing scheme preceding a secondary data analysis should include sequence quality control and data processing phase, covering the removal of low-quality sequences and bases, demultiplexing, removal of adapters, primers, and contamination, error correction, and detection of enrichment biases. Improving Quality of Big NGS Data nucleotide in the DNA/mRNA read is accompanied by information about the probability of its misidentification. This probability directly determines the phred quality score, which is given for the DNA sequence reads in FASTQ files. The quality score Q for a base-call is a logarithmic measure depending on the probability P of incorrect nucleotide identification (Ewing and Green, 1998):
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.