Abstract
BackgroundThe dramatic fall in the cost of genomic sequencing, and the increasing convenience of distributed cloud computing resources, positions the MapReduce coding pattern as a cornerstone of scalable bioinformatics algorithm development. In some cases an algorithm will find a natural distribution via use of map functions to process vectorized components, followed by a reduce of aggregate intermediate results. However, for some data analysis procedures such as sequence analysis, a more fundamental reformulation may be required.ResultsIn this report we describe a solution to sequence comparison that can be thoroughly decomposed into multiple rounds of map and reduce operations. The route taken makes use of iterated maps, a fractal analysis technique, that has been found to provide a "alignment-free" solution to sequence analysis and comparison. That is, a solution that does not require dynamic programming, relying on a numeric Chaos Game Representation (CGR) data structure. This claim is demonstrated in this report by calculating the length of the longest similar segment by inspecting only the USM coordinates of two analogous units: with no resort to dynamic programming.ConclusionsThe procedure described is an attempt at extreme decomposition and parallelization of sequence alignment in anticipation of a volume of genomic sequence data that cannot be met by current algorithmic frameworks. The solution found is delivered with a browser-based application (webApp), highlighting the browser's emergence as an environment for high performance distributed computing.AvailabilityPublic distribution of accompanying software library with open source and version control at http://usm.github.com. Also available as a webApp through Google Chrome's WebStore http://chrome.google.com/webstore: search with "usm".
Highlights
The dramatic fall in the cost of genomic sequencing, and the increasing convenience of distributed cloud computing resources, positions the MapReduce coding pattern as a cornerstone of scalable bioinformatics algorithm development
A valuable development is the support of functional programming patterns that explicitly identify opportunities for parallelization through MapReduce [10]. This development is a major attraction of cloud computing services such as Amazon’s Elastic MapReduce and is turning high performance computing into a commodity [11]
CGR and Universal Sequence Map (USM) The fundamental iteration of the Chaos Game Representation (CGR) technique [15] is that of assigning a numerical coordinate to each symbol of a sequence, calculated as the previous position plus half the distance to the
Summary
The dramatic fall in the cost of genomic sequencing, and the increasing convenience of distributed cloud computing resources, positions the MapReduce coding pattern as a cornerstone of scalable bioinformatics algorithm development. The algorithms used to process and compare sequences largely rely on the dynamic programming solutions proposed by Smith-Waterman and NeedlemanWunsch in the 70’s and 80’s [2,3]. This is not to say that the implementation of alignment algorithms has not become more efficient, quite the opposite has taken place. A valuable development is the support of functional programming patterns that explicitly identify opportunities for parallelization through MapReduce [10]. The use the MapReduce functional pattern underlie many of the leading genomic analysis packages such as GATK [12] and CloudBurst [13] and is the key cloud computing abstraction for large scale data management and analysis [11,14]
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.