Abstract

To enable efficient computations on rapidly growing big data, a variety of high-performance computing (HPC) platforms, such as traditional multi-processor systems, Hadoop and cloud computing systems, have been developed. On the analytics side of big data, several innovative machine learning methods have been developed to enable the extraction of accurate and actionable knowledge from large datasets. In particular, heterogeneous ensemble algorithms, which are designed to aggregate an unrestricted variety and number of analytical models, have performed well for a variety of prediction problems. However, the performance of these algorithms in terms of computational metrics, such as time requirement, disk space consumption and memory usage, on these HPC platforms has not been systematically examined yet. Here, we address this gap in knowledge by implementing these algorithms and systematically assessing their computational performance on traditional HPC and Hadoop platforms. Our results show that these implementations used the resources, especially disk space and memory, consistent with the respective designs of the platforms. Furthermore, due to the iterative nature of the heterogeneous ensemble computations, the traditional HPC system executed them faster than Hadoop, since an in-memory design is better suited for them than a disk-based one. Overall, our study sheds new light on the computational performance of ensemble algorithms and software frameworks on two prominent HPC platforms, and offers a systematic methodology for conducting similar assessments for other data analytics methods as well. Basic source code of our heterogeneous ensemble implementations, as well as the HPC performance assessments, are available at https://github.com/GauravPandeyLab/HPC-Ensemble.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call