Abstract

RHadoop enables R users to perform big data analytics under R programming environment by integrating R with Hadoop that supports distributed storing and parallel processing of large-scaled data. This article proposes a RHadoop-based MapReduce programming model for estimating multiple linear regression models utilizing QR factorization. Our proposed algorithm accommodates the most common type of big data that has a vast number of data points with only a few hundred variables. For QR factorization over massive-scaled data, our algorithm employs DirectQR method proposed by Benson, Gleich, Demmel(2013); however, it does not necessitate its iterative steps to estimate regression coefficients. Through a comparative simulation study, our algorithm is compared with a MapReduce-based algorithm proposed in the previous studies. For generating realistic synthetic data, we utilize NYC(New York City) yellow taxi trip data reported to NYC Taxi and Limousine Commission. In the simulation, we measure estimation time and accuracy of each algorithm under several assumptions with respect to the strength of association between independent variables and the noise level of error terms in the regression models.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.