An Approach Toward Design and Implementation of Distributed Framework for Astronomical Big Data Processing

R Monisha,Sourav Dey,K S Sri Lakshmi,Rajat U Davangeri,Snigdha Sen

doi:10.1007/978-981-19-0901-6_26

Abstract

AbstractDue to advancement of modern technology, data generation is becoming huge in all sectors in recent times. The observational astronomy has embraced modern tools, thereby generating large data. Analyzing and extracting useful pattern from those data is the need of the hour. In this paper, we have tried to implement several machine learning algorithms using Apache Spark to process this massive amount of data. The case study from cosmology we considered here is photometric redshift estimation which is a dominant research area in astronomy. Due to high end telescopic camera, lot of astronomical data is being generated which need to be analyzed efficiently and quickly. In this work, we have implemented Artificial Neural network (ANN), Random Forest, Linear Regression, and Decision Tree algorithm on Apache Spark to predict redshift of galaxies and quasars. The focus area of our study is to explore and compare execution time of those four machine learning algorithms and provide a detailed study of their performance in distributed environment as well as standalone system. The dataset used here are collected from Sloan digital Sky survey (SDSS) which is a wide range in depth sky survey. Our work shows that Random Forest outperforms other algorithms in terms of predictive performance in both the environments. Although we experimented on subset of data, scalability issue also can be treated using big data framework.KeywordsAstronomical big dataDistributed environmentSparkMachine learning

Full Text