Abstract

Traditional statistical methods and machine learning on massive datasets are challenging owing to limitations of computer primary memory. Composite quantile regression neural network (CQRNN) is an efficient and robust estimation method. But most of existing computational algorithms cannot solve CQRNN for massive datasets reliably and efficiently. In this end, we propose a divide and conquer CQRNN (DC-CQRNN) method to extend CQRNN on massive datasets. The major idea is to divide the overall dataset into some subsets, applying the CQRNN for data within each subsets, and final results through combining these training results via weighted average. It is obvious that the demand for the amount of primary memory can be significantly reduced through our approach, and at the same time, the computational time is also significantly reduced. The Monte Carlo simulation studies and an environmental dataset application verify and illustrate that our proposed approach performs well for CQRNN on massive datasets. The environmental dataset has millions of observations. The proposed DC-CQRNN method has been implemented by Python on Spark system, and it takes 8 minutes to complete the model training, whereas a full dataset CQRNN takes 5.27 hours to get a result.

Highlights

  • With the development of information technology, mobile Internet, social networks, and e-commerce have greatly expanded the boundaries and applications of the Internet

  • We propose a DC-Composite quantile regression neural network (CQRNN) method to extend CQRNN on massive datasets. e major idea is to divide the overall dataset into some subsets, applying the CQRNN for data within each subsets, and final results through combining these training results via weighted average. e proposed DC-CQRNN method can significantly reduce the computational time and the required amount of primary memory, and the training results will be as effective as analyzing the full data at the same time

  • When using the CQRNN method to dealing with environmental datasets, we found that the data can be too big so that the general computer primary memory overflowed, and the computational time is too long to get results quickly

Read more

Summary

Introduction

With the development of information technology, mobile Internet, social networks, and e-commerce have greatly expanded the boundaries and applications of the Internet. According to Intel’s forecast, in 2020, a networked selfdriving car will generate 4 TB of data every 8 hours of operation. Massive datasets offer researchers both unprecedented challenges and opportunities. E key challenge is that using conventional computing methods to directly apply machine learning and statistical methods to these massive datasets is prohibitive. The data can be too big so that the computer primary memory overflowed. In order to overcome these challenges, researchers have proposed a divide-and-conquer method [1,2,3], which may be an effective method to analyze massive datasets

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call