Abstract

The exponential growth of the quantity of data that circulates on the web led to the emergence of the big data phenomenon. This fact is a natural consequence of the proliferation of social media, mobile devices, the abundance of free online storage, and new technologies like the internet of things. Subsequently, big data has created several challenges to the computer science community, among which the large size of data is the most challenging. Traditional machine learning algorithms used mostly for insight extraction find themselves inadequate, even on high-performance computer architectures. For instance, big data analytics algorithms can overcome the size issue by either: (1) adapting the existing machine learning techniques to the scale of the big data; or, (2) by sampling big datasets, choosing randomly much smaller subsets of the data population, to meet what current algorithms can handle. In the present work, we aim to proceed through the second alternative to address the size challenge in the big data context. We propose intelligent sampling techniques based on Scalable Simple Random Sampling (ScaSRS) and Subsampled Double Bootstrap (SDB). Test results carried out on public generic datasets show that our proposal is able to address the size dimension efficiently. The proposed algorithms were evaluative on a classification task where the obtained results provided significant improvement compared to the state-of-the-art.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call