Spark Machine Learning Library Research Articles

Artificial intelligence, specifically machine learning, has been applied in a variety of methods by the research group to transform several data sources into valuable facts and understanding, allowing for superior pattern identification skills. Machine learning algorithms on huge and complicated data sets, computationally expensive on the other hand, processing requires hardware and logical resources, such as space, CPU, and memory. As the amount of data created daily reaches quintillion bytes, A complex big data infrastructure becomes more and more relevant. Apache Spark Machine learning library (ML-lib) is a famous platform used for big data analysis, it includes several useful features for machine learning applications, involving regression, classification, and dimension reduction, as well as clustering and features extraction. In this contribution, we consider Apache Spark ML-lib as a computationally independent machine learning library, which is open-source, distributed, scalable, and platform. We have evaluated and compared several ML algorithms to analyze the platform’s qualities, compared Apache Spark ML-lib against Rapid Miner and Sklearn, which are two additional Big data and machine learning processing platforms. Logistic Classifier (LC), Decision Tree Classifier (DTc), Random Forest Classifier (RFC), and Gradient Boosted Tree Classifier (GBTC) are four machine learning algorithms that are compared across platforms. In addition, we have tested general regression methods such as Linear Regressor (LR), Decision Tree Regressor (DTR), Random Forest Regressor (RFR), and Gradient Boosted Tree Regressor (GBTR) on SUSY and Higgs datasets. Moreover, We have evaluated the unsupervised learning methods like K-means and Gaussian Mixer Models on the data set SUSY and Hepmass to determine the robustness of PySpark, in comparison with the classification and regression models. We used ”SUSY,” ”HIGGS,” ”BANK,” and ”HEPMASS” dataset from the UCI data repository. We also talk about recent developments in the research into Big Data machines and provide future research directions.

Read full abstract

SummaryThere are many data sources that produce large volumes of data. The Big Data nature requires new distributed processing approaches to extract the valuable information. Real‐time sentiment analysis is one of the most demanding research areas that requires powerful Big Data analytics tools such as Spark. Prior literature survey work has shown that, though there are many conventional sentiment analysis researches, there are only few works realizing sentiment analysis in real time. One major point that affects the quality of real‐time sentiment analysis is the confidence of the generated data. In more clear terms, it is a valuable research question to determine whether the owner that generates sentiment is genuine or not. Since data generated by fake personalities may decrease accuracy of the outcome, a smart/intelligent service that can identify the source of data is one of the key points in the analysis. In this context, we include a fake account detection service to the proposed framework. Both sentiment analysis and fake account detection systems are trained and tested using Naïve Bayes model from Apache Spark's machine learning library. The developed system consists of four integrated software components, ie, (i) machine learning and streaming service for sentiment prediction, (ii) a Twitter streaming service to retrieve tweets, (iii) a Twitter fake account detection service to assess the owner of the retrieved tweet, and (iv) a real‐time reporting and dashboard component to visualize the results of sentiment analysis. The sentiment classification performances of the system for offline and real‐time modes are 86.77% and 80.93%, respectively.

Read full abstract

Spark Machine Learning Library Research Articles

Related Topics

Articles published on Spark Machine Learning Library

Analyzing SQL payloads using logistic regression in a big data environment

Performance Evaluation of Data-driven Intelligent Algorithms for Big data Ecosystem.

Performance analysis of disease diagnostic system using IoMT and real‐time data analytics

Assessing naive Bayes and support vector machine performance in sentiment classification on a big data platform

Large scale data analysis using MLlib

Fast texture classification of denoised SAR image patches using GLCM on Spark

A spark‐based big data analysis framework for real‐time sentiment prediction on streaming data

A big data driven distributed density based hesitant fuzzy clustering using Apache spark with application to gene expression microarray

Short-term load forecasting with clustering–regression model in distributed cluster

Human Action Recognition Using Adaptive Local Motion Descriptor in Spark

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Spark Machine Learning Library Research Articles

Related Topics

Articles published on Spark Machine Learning Library

Analyzing SQL payloads using logistic regression in a big data environment

Performance Evaluation of Data-driven Intelligent Algorithms for Big data Ecosystem.

Performance analysis of disease diagnostic system using IoMT and real‐time data analytics

Assessing naive Bayes and support vector machine performance in sentiment classification on a big data platform

Large scale data analysis using MLlib

Fast texture classification of denoised SAR image patches using GLCM on Spark

A spark‐based big data analysis framework for real‐time sentiment prediction on streaming data

A big data driven distributed density based hesitant fuzzy clustering using Apache spark with application to gene expression microarray

Short-term load forecasting with clustering–regression model in distributed cluster

Human Action Recognition Using Adaptive Local Motion Descriptor in Spark