Abstract

In this work, we search for possible approaches to large-scale data mining analytics. We perform an exploration about the existing MapReduce and other MapReduce-like frameworks for distributed data processing and the distributed file systems for distributed data storage. We study in detail about Hadoop Distributed File System (HDFS) and Hadoop MapReduce software framework. We analyse the benefits of newer version of Hadoop software framework which provides better scalability solution by segregating the cluster resource management task from MapReduce framework. This version is called YARN and is very flexible in supporting various kinds of distributed data processing other than batchmode processing of MapReduce. We also looked into various implementations of data mining algorithms based on MapReduce to derive a comprehensive concept about developing such algorithms. We also looked for various tools that provided MapRedcue based scalable data mining algorithms. We could only find Mahout as a tool specially based on Hadoop MapReduce. But the tool developer team decided to stop using Hadoop MapReduce and to use instead Apache Spark as the underlying execution engine. WEKA also has a very small subset of data mining algorithms implemented using MapReduce which is not properly maintained and supported by the developer team. Subsequently, we found out that Apache Spark, apart from providing an optimised and a faster execution engine for distributed processing also provided an accompanying library for machine learning algorithms. This library is called Machine Learning library (MLlib). Apache Spark claimed that it is much faster than Hadoop MapReduce as it exploits the advantages of in-memory computations which is particularly more beneficial for iterative workloads in case of data mining. Spark is designed to work on variety of clusters: YARN being one of them. It is designed to process the Hadoop data. We selected to perform a particular data mining task: decision tree learning based classification and regression data mining. We stored properly labelled training data for predictive mining tasks in HDFS. We set up a YARN cluster and run Spark's MLlib applications on this cluster. These applications use the cluster managing capabilities of YARN and the distributed execution framework of Spark core services. We performed several experiments to measure the performance gains, speed-up and scaleup of implementations of decision tree learning algorithms in Spark's MLlib. We found out much better than expected results for our experiments. We achieved a much higher than ideal speed-up when we increased the number of nodes. The scale-up is also very excellent. There is a significant decrease in run-time for training decision tree models by increasing the number of nodes. This demonstrates that Spark's MLlib decision tree learning algorithms for classification and regression analysis are highly scalable.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.