Abstract

Range-join is an operation for finding overlaps in interval-form genomic data. Range-join is widely used in various genome analysis processes such as annotation, filtering and comparison of variants in whole-genome and exome analysis pipelines. The quadratic complexity of current algorithms with sheer data volume has surged the design challenges. Existing tools have limitations on algorithm efficiency, parallelism, scalability and memory consumption. This paper proposes BIndex, a novel bin-based indexing algorithm and its distributed implementation to attain high throughput range-join processing. BIndex features near-constant search complexity while the inherently parallel data structure facilitates exploitation of parallel computing architectures. Balanced partitioning of dataset further enables scalability on distributed frameworks. The implementation on Message Passing Interface shows upto 933.5x speedup in comparison to state-of-the-art tools. Parallel nature of BIndex further enables GPU-based acceleration with 3.72x speedup than CPU implementations. The add-in modules for Apache Spark provides upto 4.65x speedup than the previously best available tool. BIndex supports wide variety of input and output formats prevalent in bioinformatics community and the algorithm is easily extendable to streaming data in recent Big Data solutions. Furthermore, the index data structure is memory-efficient and consumes upto two orders-of-magnitude lesser RAM, while having no adverse effect on speedup.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.