Abstract
BackgroundWhole-genome sequencing (WGS) plays an increasingly important role in clinical practice and public health. Due to the big data size, WGS data analysis is usually compute-intensive and IO-intensive. Currently it usually takes 30 to 40 h to finish a 50× WGS analysis task, which is far from the ideal speed required by the industry. Furthermore, the high-end infrastructure required by WGS computing is costly in terms of time and money. In this paper, we aim to improve the time efficiency of WGS analysis and minimize the cost by elastic cloud computing.ResultsWe developed a distributed system, GT-WGS, for large-scale WGS analyses utilizing the Amazon Web Services (AWS). Our system won the first prize on the Wind and Cloud challenge held by Genomics and Cloud Technology Alliance conference (GCTA) committee. The system makes full use of the dynamic pricing mechanism of AWS. We evaluate the performance of GT-WGS with a 55× WGS dataset (400GB fastq) provided by the GCTA 2017 competition. In the best case, it only took 18.4 min to finish the analysis and the AWS cost of the whole process is only 16.5 US dollars. The accuracy of GT-WGS is 99.9% consistent with that of the Genome Analysis Toolkit (GATK) best practice. We also evaluated the performance of GT-WGS performance on a real-world dataset provided by the XiangYa hospital, which consists of 5× whole-genome dataset with 500 samples, and on average GT-WGS managed to finish one 5× WGS analysis task in 2.4 min at a cost of $3.6.ConclusionsWGS is already playing an important role in guiding therapeutic intervention. However, its application is limited by the time cost and computing cost. GT-WGS excelled as an efficient and affordable WGS analyses tool to address this problem. The demo video and supplementary materials of GT-WGS can be accessed at https://github.com/Genetalks/wgs_analysis_demo.
Highlights
Whole-genome sequencing (WGS) plays an increasingly important role in clinical practice and public health
GT-WGS, the distributed whole-genome computing system we built, is based on the cloud computing platform provided by Amazon Web Service
It took GT-WGS about 18 min to accomplish the 55× WGS data analysis at a computing cost of $16.5, and 2.39 min for each 5× whole-genome sequencing with $3.62 on average, and the output accuracy is up to standards required by the Genomics and Cloud Technology Alliance conference (GCTA) challenge
Summary
Whole-genome sequencing (WGS) plays an increasingly important role in clinical practice and public health. Due to the big data size, WGS data analysis is usually compute-intensive and IO-intensive. It usually takes 30 to 40 h to finish a 50× WGS analysis task, which is far from the ideal speed required by the industry. WGS data plays an important role in guiding disease prevention, clinical diagnoses and therapeutic intervention [4, 5]. WGS by NGS is transforming the diagnostic testing, treatment selection, and many other clinical practices, due to its ability to rapidly test almost all genes potentially related to diseases and help reveal the pathogenesis beneath symptoms, which is meaningful in dealing with some rare or complex diseases. It provides researchers a great opportunity to conduct more comprehensive and profound genetic studies, with which we can better understand the relationship between genome and some complex diseases such as cancer and identify the effects of DNA variations
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.