Ανάλυση Συναισθημάτων Κειμένου σε Hadoop & Spark

Αλέξανδρος Μήλας

doi:10.26265/polynoe-246

Abstract

In recent years where the volume of data required to be processed in (commercial, research-based, and others) systems is increasing, there is a strong interest in terms of studying techniques in order to manage this volume efficiently and retrieve adequate results. One of the preferred techniques for such systems is the process of these large volumes of data in the scale of big data using commonly and commercially available equipment to design and implement it in a distributed environment structure. The purpose of this effort is to support the parallel execution of tasks for different chunks of data, in order to achieve an acceleration regarding the execution time, as well as better behavior in scalability depending on the size of the data given at the input, in relation to common serial programming implementations. One programming model in which such systems are based on is that of MapReduce, which consists of sequences of two simple types of functions to be used on the data and can be easily applied to the needs of a particular application, either through a local infrastructure or using remote resource services as in the cloud computing. The open source platform of Apache Hadoop is one of the best known for supporting MapReduce tasks, which thanks to its popularity gives the impetus for the implementation of useful extensions that are either entirely based on its very own structure or innovate beyond that for performance improvements of specific applications. One of these latter extensions is Apache Spark, which prioritizes processing in-memory over using disk storage and supports a plethora of useful implementations from its libraries. One of the task types that can be applied in both of the mentioned platforms is that of text classification and more specifically text sentiment analysis, where using machine learning techniques a model is created in an attempt to determine the sentiment that characterizes each text document. The main purpose of this thesis is to investigate and develop applications that implement the models of document classification algorithms (such as those of Naive Bayes and Support Vector Machines) using the platforms of Hadoop and Spark, but also to test modified versions of them, in order to examine their results in an experimental environment in terms of efficiency and parallel execution and finally evaluate them on the basis of indicative usage scenarios.

Full Text