Abstract

Modes Of Operation The 4 modes of operation are presented below: 1. MCL clusters of the protein query and database Results In the era of Big Data in Life Sciences, efficient process and analysis of vast amounts of sequence We have implemented the proposed framework using a number of scripts suitable for a Unix sequences. The clustering criteria is the BLAST output (identity or e-value), based on the preference of the user 2. Phylogenetic profiles of each query sequence, where the genomes into consideration are the ones whose proteins form the database 3. MCL clusters of the protein query sequences and database genomes, and phylogenetic profiles. The MCL clustering criteria is the phylogenetic profiles. 4. This mode is essentially a combination of the data is becoming an ever daunting challenge. Among such analyses, sequence alignment is one of the most commonly used procedures, as it provides useful insights on the functionality and relationship of the involved entities. At the same time however, it is one of the most common computational bottlenecks in several bioinformatics workflows, especially when combined with the construction of families and phylogenetic profiles. We have designed and implemented a time-efficient distributed modular application for sequence environment of a Grid Infrastructure. We present the scaling of the average job execution time and inqueue time, for different number of submitted jobs. The in-queue time is computed as the time spent from the job submission until the time that its execution was initiated. Program Flow output produced in modes 1 and 3. There is also a fifth mode that generates the same output as the fourth one, with the only difference being that the same file is used both as a database and a query. This is the case of an all-vs-all sequence comparison, widely used when performing a pangenome analysis. alignment, phylogenetic profiling and clustering of protein sequences, by utilizing the European Grid Infrastructure. Specifically, the application comprises three main components: (a) BLAST alignment, (b) construction of phylogenetic profiles based on the produced alignment scores, and (c) clustering of entities using the MCL algorithm. These modules have been selected as they represent a common aspect of a vast majority of Fig. 4: Average run and in-queue time per job, as a function of the number of jobs submitted Furthermore, we have evaluated the framework on a real-world scenario, i.e. the analysis of a plant pangenome bioinformatics workflows. It is important to note that the modules can be combined independently, and ultimately provide 4 different modes of operation. We have evaluated the application through several different scenarios, ranging from targeted investigations of enzymes participating in selected pathways against a custom database to produce functional groups, to large scale comparisons at the pangenome level. In all cases, the optimal utilization of the Grid with regards to the respective modules, Our proposed framework proceeds with the distribution of both processes and data across the provided resources. The distribution is performed automatically, based on the selected mode as well as the data under study. . References Background allowed us to achieve significant speedup, in the order of 14x with respect to traditional approaches. Grid Computing is an established method of high performance computing that is mostly utilized by embarrassingly parallel processes. Duarte AMS, Psomopoulos FE et. al. (2015): Future opportunities Fig. 5: Output of the All-vs-All comparison of the ~2M sequences of the plant pangenome . Horizontal axis: 98 plant species, Vertical axis: 1.979.749 phylogenetic profiles. Overall execution time with the proposed framework: ~300 h, avg. phylogenetic profile time:16.21 h, avg. BLAST time: 28.05 h, avg. time-in-queue: 66.1 h, MCL: 28 h Useful resources The input comprises of the following files: 1. two files containing the query protein sequences and the database protein sequences to be aligned, in FASTA format, which is a text-based format for representing nucleotide or peptide sequences, and trends for e-infrastructures and life sciences: going beyond the grid to enable life science data analysis. Front. Genet. 6:197 A.J. Enright, S. Van Dongen, C. A. Ouzounis (2002): An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res.; 30(7): 1575–1584 Psomopoulos, Fotis E, et. al. (2014): The Chlamydiales Pangenome Revisited: Structural Stability and Functional Coherence, Genes 3(2): 291-319 Fig. 2: General flow chart of the process Fig. 1: An example of a Grid Architecture Source code: https://github.com/BioDAG/BPM European Grid Infrastructure (http://www.egi.eu/)

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call