Elastic MapReduce Research Articles

Hadoop has become a popular framework for processing data-intensive applications in cloud environments. A core constituent of Hadoop is the scheduler, which is responsible for scheduling and monitoring the jobs and tasks, and rescheduling them in case of failures. Although fault-tolerance mechanisms have been proposed for Hadoop, the performance of Hadoop can be significantly impacted by unforeseen events in the cloud environment. In this paper, we introduce a dynamic and failure-aware framework that can be integrated within Hadoop scheduler and adjust the scheduling decisions based on collected information about the cloud environment. Our framework relies on predictions made by machine learning algorithms and scheduling policies generated by a Markovian Decision Process (MDP), to adjust its scheduling decisions on the fly. Instead of the fixed heartbeat-based failure detection commonly used in Hadoop to track active TaskTrackers (i.e., nodes that process the scheduled tasks), our proposed framework implements an adaptive algorithm that can dynamically detect the failures of the TaskTracker. To deploy our proposed framework, we have built, ATLAS+, an AdapTive Failure-Aware Scheduler for Hadoop. To assess the performance of ATLAS+, we conduct a large empirical study on a 100-node Hadoop cluster deployed on Amazon Elastic MapReduce (EMR), comparing the performance of ATLAS+ with those of three Hadoop schedulers (FIFO, Fair, and Capacity). Results show that ATLAS+ outperforms FIFO, Fair, and Capacity schedulers. ATLAS+ can reduce the number of failed jobs by up to 43 percent and the number of failed tasks by up to 59 percent. On average, ATLAS+ could reduce the total execution time of jobs by 10 minutes, which represents 40 percent of the job execution times, and by up to 3 minutes for tasks, which represents 47 percent of the task execution time. ATLAS+ also reduced CPU and memory usage by 22 and 20 percent, respectively.

Read full abstract

MotivationBacterial metagenomics profiling for metagenomic whole sequencing (mWGS) usually starts by aligning sequencing reads to a collection of reference genomes. Current profiling tools are designed to work against a small representative collection of genomes, and do not scale very well to larger reference genome collections. However, large reference genome collections are capable of providing a more complete and accurate profile of the bacterial population in a metagenomics dataset. In this paper, we discuss a scalable, efficient and affordable approach to this problem, bringing big data solutions within the reach of laboratories with modest resources.ResultsWe developed Flint, a metagenomics profiling pipeline that is built on top of the Apache Spark framework, and is designed for fast real-time profiling of metagenomic samples against a large collection of reference genomes. Flint takes advantage of Spark’s built-in parallelism and streaming engine architecture to quickly map reads against a large (170 GB) reference collection of 43 552 bacterial genomes from Ensembl. Flint runs on Amazon’s Elastic MapReduce service, and is able to profile 1 million Illumina paired-end reads against over 40 K genomes on 64 machines in 67 s—an order of magnitude faster than the state of the art, while using a much larger reference collection. Streaming the sequencing reads allows this approach to sustain mapping rates of 55 million reads per hour, at an hourly cluster cost of $8.00 USD, while avoiding the necessity of storing large quantities of intermediate alignments.Availability and implementation Flint is open source software, available under the MIT License (MIT). Source code is available at https://github.com/camilo-v/flint.Supplementary information Supplementary data are available at Bioinformatics online.

Read full abstract

Elastic MapReduce Research Articles

Related Topics

Articles published on Elastic MapReduce

A powerful heuristic method for generating efficient database systems

A Dynamic and Failure-Aware Task Scheduling Framework for Hadoop

Distributed Approach to Process Satellite Image Edge Detection on Hadoop Using Artificial Bee Colony

DECA: scalable XHMM exome copy-number variant calling with ADAM and Apache Spark

Large scale microbiome profiling in the cloud.

Device Contextual Content Publishing in Media & Publishing Industry using Big Data Analytics on AWS

Optimizing the Performance of Clouds Using Hash Codes in Apache Hadoop and Spark

Spark for Social Science

A course on big data analytics

Distributed simulation optimization and parameter exploration framework for the cloud

HCE<SUB align="right">m model and a comparative workload analysis of Hadoop cluster

Rail-dbGaP: analyzing dbGaP-protected data in the cloud with Amazon Elastic MapReduce.

Using adaptive resource allocation to implement an elastic MapReduce framework

Leveraging MapReduce to efficiently extract associations between biomedical concepts from large text data

HCE<SUB align="right">m model and a comparative workload analysis of Hadoop cluster

Energy Cost Aware Scheduling of MapReduce Jobs across Geographically Distributed Nodes

Hadoop Based Data Intensive Computation on IaaS Cloud Platforms

Ontology Based Document Clustering Using MapReduce

Toward protecting control flow confidentiality in cloud-based computation

Processing Shotgun Proteomics Data on the Amazon Cloud with the Trans-Proteomic Pipeline

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Elastic MapReduce Research Articles

Related Topics

Articles published on Elastic MapReduce

A powerful heuristic method for generating efficient database systems

A Dynamic and Failure-Aware Task Scheduling Framework for Hadoop

Distributed Approach to Process Satellite Image Edge Detection on Hadoop Using Artificial Bee Colony

DECA: scalable XHMM exome copy-number variant calling with ADAM and Apache Spark

Large scale microbiome profiling in the cloud.

Device Contextual Content Publishing in Media &amp; Publishing Industry using Big Data Analytics on AWS

Optimizing the Performance of Clouds Using Hash Codes in Apache Hadoop and Spark

Spark for Social Science

A course on big data analytics

Distributed simulation optimization and parameter exploration framework for the cloud

HCE&lt;SUB align="right"&gt;m model and a comparative workload analysis of Hadoop cluster

Rail-dbGaP: analyzing dbGaP-protected data in the cloud with Amazon Elastic MapReduce.

Using adaptive resource allocation to implement an elastic MapReduce framework

Leveraging MapReduce to efficiently extract associations between biomedical concepts from large text data

HCE&lt;SUB align="right"&gt;m model and a comparative workload analysis of Hadoop cluster

Energy Cost Aware Scheduling of MapReduce Jobs across Geographically Distributed Nodes

Hadoop Based Data Intensive Computation on IaaS Cloud Platforms

Ontology Based Document Clustering Using MapReduce

Toward protecting control flow confidentiality in cloud-based computation

Processing Shotgun Proteomics Data on the Amazon Cloud with the Trans-Proteomic Pipeline

Device Contextual Content Publishing in Media & Publishing Industry using Big Data Analytics on AWS

HCE<SUB align="right">m model and a comparative workload analysis of Hadoop cluster

HCE<SUB align="right">m model and a comparative workload analysis of Hadoop cluster