Abstract

The storage and analysis of massive genetic variation datasets in variant call format (VCF) become a great challenge with the rapid growth of genetic variation data in recent years. Traditional single process based tool kits become increasingly inefficient when analyzing massive genetic variation data. While emerging distributed storage technology such as Apache Kudu offers attractive solution, it is demanded to develop distributed storage tool kit for VCF dataset. In this article, we present Variant-Kudu, an efficient genome tool kit for storing and analyzing massive genetic variation datasets. Based on a new distributed scheme, the genetic variation data would be segmented and stored in Kudu on multinode. With this scheme, data can be randomly accessed at low latency and scanned efficiently. Aiming at reducing the queries' execution time, a strategy of distributed bitmap index is proposed and a parallel query method is designed, which expedite analyses of massive genetic variation data. Variant-Kudu is a scalable tool kit to analyze massive genetic variation datasets, and our experiments demonstrate that Variant-Kudu achieves high performance on a multinode cluster.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.