Abstract

Many statistical and machine learning (ML) techniques have been successfully applied to small-sized datasets during the past one and half decades. However, in today's world, different application domains, viz., healthcare, finance, bioinformatics, telecommunications, and meteorology, generate huge volumes of data on a daily basis. All these massive datasets have to be analyzed for discovering hidden insights. With the advent of big data analytics (BDA) paradigm, the data mining (DM) techniques were modified and scaled out to adapt to the distributed and parallel environment. This chapter reviewed 249 articles appeared between 2009 and 2019, which implemented different DM techniques in a parallel, distributed manner in the Apache Hadoop MapReduce framework or Apache Spark environment for solving various DM tasks. We present some critical analyses of these papers and bring out some interesting insights. We have found that methods like Apriori, support vector machine (SVM), random forest (RF), K-means and many variants of the previous along with many other approaches are made into parallel distributed environment and produced scalable and effective insights out of it. This review is concluded with a discussion of some open areas of research with future directions, which can be explored further by the researchers and practitioners alike.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.