The application of Hadoop in structural bioinformatics.

Jamie J Alnasir,Hugh P Shanahan

doi:10.1093/bib/bby106

Abstract

The paper reviews the use of the Hadoop platform in structural bioinformatics applications. For structural bioinformatics, Hadoop provides a new framework to analyse large fractions of the Protein Data Bank that is key for high-throughput studies of, for example, protein-ligand docking, clustering of protein-ligand complexes and structural alignment. Specifically we review in the literature a number of implementations using Hadoop of high-throughput analyses and their scalability. We find that these deployments for the most part use known executables called from MapReduce rather than rewriting the algorithms. The scalability exhibits a variable behaviour in comparison with other batch schedulers, particularly as direct comparisons on the same platform are generally not available. Direct comparisons of Hadoop with batch schedulers are absent in the literature but we note there is some evidence that Message Passing Interface implementations scale better than Hadoop. A significant barrier to the use of the Hadoop ecosystem is the difficulty of the interface and configuration of a resource to use Hadoop. This will improve over time as interfaces to Hadoop, e.g. Spark improve, usage of cloud platforms (e.g. Azure and Amazon Web Services (AWS)) increases and standardised approaches such as Workflow Languages (i.e. Workflow Definition Language, Common Workflow Language and Nextflow) are taken up.

Highlights

The Apache Hadoop project [73] is a software ecosystem i.e. a collection of interrelated, interacting projects forming a common technological platform [48] for analysing large data sets.Hadoop presents three potential advantages for the analysis of large Biological data sets
The most commonly used methods have been deployed as Java packages for the Hadoop platform. This includes PSIPRED for protein structure prediction [44], GenTHREADER for protein fold recognition method using genomic sequences [28], BioSerf - a homology modelling protocol, MEMSAT for improving accuracy of transmembrane protein topology prediction [29], DomPred for protein domain boundary prediction [10], MetSite for predicting clusters of metalbinding residues [71], and FFPred which uses a machine learning approach for predicting protein function [40]. This purpose of this review is to give an insight into the impact that Hadoop and the MapReduce formalism has in Structural Bioinformatics
As noted previously the adoption of Hadoop is not a trivial step, for a Structural Bioinformatics lab that already has extensive experience in using traditional batch schedulers running on a local cluster

Summary

Introduction

The Apache Hadoop project [73] is a software ecosystem i.e. a collection of interrelated, interacting projects forming a common technological platform [48] for analysing large data sets. Hadoop presents barriers to its adoption within the community for Bioinformatics and the analysis of structural data. Implementing Hadoop on a local cluster is not trivial and requires a significant level of expertise from the relevant systems administrator. As we note, this latter difficulty is obviated on cloud platforms such as Azure and AWS [66]. In the first instance a brief overview of the Hadoop system as well as a description of batch schedulers and MPI.

Hadoop and MapReduce

Batch schedulers

Applications of Hadoop in Bioinformatics

Applications in Structural Bioinformatics

Molecular docking

Docking of protein-ligand complexes on Hadoop

Clustering of protein-ligand complexes

Structural Alignment

Other Structural Bioinformatics applications using Hadoop

Findings

Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Briefings in bioinformatics	Publication Date: Nov 20, 2018
Citations: 8	License type: cc-by

R Discovery Prime

R Discovery Prime

The application of Hadoop in structural bioinformatics.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Briefings in bioinformatics

Lead the way for us

Similar Papers

A Review on Design and Development of Performance Evaluation Model for Bio-Informatics Data Using Hadoop
Ravi Kumar A, Et Al
Turkish Journal of Computer and Mathematics Education (TURCOMAT) | VOL. 12
Ravi Kumar A, Et AlRavi Kumar A, Et Al
10 Apr 2021
Turkish Journal of Computer and Mathematics Education (TURCOMAT) | VOL. 12

Author response: Simplifying the development of portable, scalable, and reproducible workflows
Stephen R Piccolo ... Zachary E Ence
-
Stephen R Piccolo, et. al.Stephen R Piccolo ... Zachary E Ence
20 Sep 2021
20 Sep 2021

Simplifying the development of portable, scalable, and reproducible workflows.
Stephen R Piccolo ... Jeffrey T Chang
eLife | VOL. 10
Stephen R Piccolo, et. al.Stephen R Piccolo ... Jeffrey T Chang
13 Oct 2021
eLife | VOL. 10

Abstract 2465: Genomic harmonization of the Data Resource Center for Gabriella Miller Kids First Pediatric Research Program
Yuankun Zhu ...
Cancer Research | VOL. 79
Yuankun Zhu, et. al.Yuankun Zhu ...
01 Jul 2019
Abstract 2465: Genomic harmonization of the Data Resource Center for Gabriella Miller Kids First Pediatric Research Program
Yuankun Zhu ...

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

The application of Hadoop in structural bioinformatics.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Briefings in bioinformatics