Understanding Distributed Semantic Analysis with Spark Data Frames

Richa Mathur,Dhanesh Kumar Solanki,Devesh K Bandil

doi:10.1201/9781003055129-2

Abstract

Big data analytics is the most current research trends till date. Big data faces many challenges in terms of processing and storing of data for optimization and implementation of machine learning (ML) algorithms as ML needs too much computation power to process data for any kind of output in terms of analytics. To reduce the optimization problem, we are using the spark for its data frames. Also, in creating the architecture of any big data problem, spark engine provides very good support in that too as its engine support any archistecture with Hadoop, map-reduce. Analytics in big data need real-time processing for any behaviors that we needed to predict under any type of text analytics. While working on semantic analysis, we have faced many problems as there are many different spelling mistakes in documents from online sources. To handle this problem, we are creating different distributed schema under different layers of distributed data frames of spark engine with support of YARN, as this Yarn supports us for any kind of map-reduce process we need to follow to solve sematic behaviors of any mistakes that happened inside sources. But, the first step that we followed is to create ontology from the above semantic characteristics. As the base architecture of the ontology helps us to understand how data behave under different processes of semantic field under any language used in many datasets while working with Natural Language Processing (NLP), we need to take care of the sentiment of the document to parse the information under positive, negative, or neutral in term of sentiment analysis. In this research, we mainly focus on ontology architecture to process different semantic information to increase the sentiment accuracy by implementing different advance ML.

Full Text