Abstract

BackgroundThe Ensembl project produces updates to its comparative genomics resources with each of its several releases per year. During each release cycle approximately two weeks are allocated to generate all the genomic alignments and the protein homology predictions. The number of calculations required for this task grows approximately quadratically with the number of species. We currently support 50 species in Ensembl and we expect the number to continue to grow in the future.ResultsWe present eHive, a new fault tolerant distributed processing system initially designed to support comparative genomic analysis, based on blackboard systems, network distributed autonomous agents, dataflow graphs and block-branch diagrams. In the eHive system a MySQL database serves as the central blackboard and the autonomous agent, a Perl script, queries the system and runs jobs as required. The system allows us to define dataflow and branching rules to suit all our production pipelines. We describe the implementation of three pipelines: (1) pairwise whole genome alignments, (2) multiple whole genome alignments and (3) gene trees with protein homology inference. Finally, we show the efficiency of the system in real case scenarios.ConclusionseHive allows us to produce computationally demanding results in a reliable and efficient way with minimal supervision and high throughput. Further documentation is available at: http://www.ensembl.org/info/docs/eHive/.

Highlights

  • The Ensembl project produces updates to its comparative genomics resources with each of its several releases per year

  • Many of these systems have a latency of several seconds between the job submission and its execution and most are designed around the idea that jobs will run for an hour or more. They are not designed for handling 100 million jobs that run for only a few seconds each. To manage this increased job queuing overhead, applications with large numbers of short jobs often require another system on top of the job scheduler to "batch" jobs so that they can match the parameters of the job scheduler

  • Here we describe the eHive system for large-scale genomic analysis

Read more

Summary

Introduction

The Ensembl project produces updates to its comparative genomics resources with each of its several releases per year. The Ensembl project provides an integrated system for the annotation of chordate genomes and the management of genome information [1]. Data updates are provided for recently sequenced species, for those species with new assemblies and when additional information is available. The data is provided through the Ensembl Genome Browser (http://www.ensembl.org), a Perl API, via direct querying of the underlying databases or via Biomart, a data-mining tool [2]. The same public Perl API is used by both the web server to fetch the data from the database and the project members themselves for accessing data, analysis and storing the results of the analyses

Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call