Statika: managing cloud resources, bioinformatics tools and data

Alexey Alekhin ,Pablo Pareja-Tobes ,Eduardo Pareja ,Raquel Tobes ,Marina Manrique ,Evdokim Kovach ,Eduardo Pareja-Tobes

doi:10.5281/zenodo.35101

Abstract

Next Generation Sequencing (NGS) has brought a revolution to the bioinformatics l andscape, de finitely r eshaping f ields s uch a s g enomics a nd transcriptomics, by offering sheer amounts of data about previously inaccessible d omains i n a c heap a nd s calable w ay. Thus bi ological da ta a nalysis d emands, more than ever, high performance computing architectures; in particular, Cloud Computing, a comparable breakthrough in the IT world, holds promise f or be ing t he f oundation o n which a s olution c ould be built ( as a lready demonstrated by pioneering efforts such as Galaxy or CloudBioLinux). It provides a perfect framework for high throughput data analysis: deploying architectures with as much co mputing cap acity as n eeded, s caling in an horizontal way, being also able to scale down adjusting to the computing needs real time, or the pay-as-you-go model make for a strong case. However, fast, reproducible, and cost-effective data analysis in the cloud at such scale remains elusive. Certainly, one fundamental prerequisite for achieving this is having the ability to manage both the tools and data to be used in a robust, reproducible, and automated way. High throughput analysis, where a lot of resources are to be used and paid for, needs to have a robust configuration system to rely on. In the cloud computing world, due to its on-demand nature, automated resource configuration is a critical factor. This is even more so in the case of bioinformatics analysis where pretty often a pretty intricate and unstable chain of dependencies underlies tools and data; knowing beforehand that all the resources to be used are properly configured is invaluable. Statika ( http://ohnosequences.com/statika) ai ms t o b e a b asic t ool f or t he declaration and deployment of c omposable, versioned and r eproducible c loud infrastructures for the bioinformatics space. Data, tools and infrastructure are treated on an equal footing, and a ex pressive domain specific language al lows the user to express complex dependency relationships, c heck for pos sible version c onflicts a nd a utomatically c hoose a safe resource creation order. By making us e of a dvanced features of t he S cala pr ogramming l anguage such as dependent types and type-level co mputations a g reat deal o f s tructure can be expressed abstractly, and checked at compile time before any cost is incurred. A s trong ve rsioning s ystem where bot h da ta a nd t ools a re i ncluded makes reproducibility not only possible but actually enforced. Proceedings IWBBIO 2014. Granada 7-9 April, 2014 1412 Statika has been put to work on scenarios as different as a cloud-based system for scaling inherently parallel computations in the bioinformatics domain: Nispero, or by pr oviding v ersioned a nd m odular a utomated de ployments of Bio4j, a g raph database integrating all data from key resources in the bioinformatics data space, including: UniProt, Gene Ontology, the NCBI Taxonomy or UniRef. We use i t internally for the integration and automated deployment of all sort of bioinformatics tools and data. Statika is open source, available under the AGPLv3 license. This pr oject i s f unded i n pa rt by t he I TN F P7 pr oject I NTERCROSSING (Grant 289974). Proceedings IWBBIO 2014. Granada 7-9 April, 2014 1413

Full Text