Life Sciences have been established and widely accepted as a foremost Big Data discipline; as such they are a constant source of the most computationally challenging problems. In order to provide efficient solutions, the community is turning towards scalable approaches such as the utilization of cloud resources in addition to any existing local computational infrastructures. Although bioinformatics workflows are generally amenable to parallelization, the challenges involved are however not only computationally, but also data intensive. In this paper we propose a data management methodology for achieving parallelism in bioinformatics workflows, while simultaneously minimizing data-interdependent file transfers. We combine our methodology with a novel two-stage scheduling approach capable of performing load estimation and balancing across and within heterogeneous distributed computational resources. Beyond an exhaustive experimentation regime to validate the scalability and speed-up of our approach, we compare it against a state-of-the-art high performance computing framework and showcase its time and cost advantages.