Big data is a fantastic resource for disseminating system-generated insights to external stakeholders. However, automation is required to manage such a large body of information, and this has spurred the development of data processing and machine learning tools. Just as in other fields of study and business, the ICT industry is serving and developing platforms and solutions to help professionals treat their knowledge and learn automatically. Large companies like Google and Microsoft, as well as the Apache Foundation's incubator, are the primary providers of these platforms. Spark is an open-source platform for handling Big Data insights that have been tainted by contamination. This unified framework provides a variety of methods for dealing with unstructured or structured text data, graph data, and real-time streaming data. Spark relies on MLlib to create customised ML algorithms. To parallelize a huge cluster of machines for data analytics, these methods require less memory, less processing time, and, to a large extent, hand tuned specialized architecture. Data sets are analysed with machine learning methods including Linear Regression, Decision Tree, Random Forest, and Gradient Boosting Tree. In order to comprehend the data sets with the help of machine learning algorithms and to determine the best forecast value from the comparative study, the prediction model provided in this research is used. One key goal of this study is to use the proposed model to make the most accurate forecast possible utilising machine learning methods. The suggested model utilizes the Apache Spark framework to perform a comparative analysis of the various existing approaches that have implemented the supervised and unsupervised techniques utilizing the MapReduce approach. By comparing the temporal complexity of each method, this method calculates the best prediction from the model. This dissertation emphasizes the characteristics of datasets that are most useful for examining the most effective prediction using machine learning algorithms.
Read full abstract