Abstract

RHadoop enables R users to perform big data analytics under R programming environment by integrating R with Hadoop that supports distributed storing and parallel processing of large-scaled data. This article proposes a RHadoop-based MapReduce programming model for estimating multiple linear regression models utilizing QR factorization. Our proposed algorithm accommodates the most common type of big data that has a vast number of data points with only a few hundred variables. For QR factorization over massive-scaled data, our algorithm employs DirectQR method proposed by Benson, Gleich, Demmel(2013); however, it does not necessitate its iterative steps to estimate regression coefficients. Through a comparative simulation study, our algorithm is compared with a MapReduce-based algorithm proposed in the previous studies. For generating realistic synthetic data, we utilize NYC(New York City) yellow taxi trip data reported to NYC Taxi and Limousine Commission. In the simulation, we measure estimation time and accuracy of each algorithm under several assumptions with respect to the strength of association between independent variables and the noise level of error terms in the regression models.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call