Abstract

Real-time big data analytics, which is the combination of real-time analytics and big data, works on processing large scale data as it arrives and strives to obtain insights from it without exceeding a limited time period. Massive amount of data is being generated every moment through web sites, social networks, scientific experiments, etc. which is stored in the cloud. When decision making requires an insight of the raw data in real-time (often by applying machine learning algorithms such as dimensionality reduction), these algorithms fail to analyze these incessantly flowing high volume data i.e dynamic big data. Our work proposes a variant of scalable principal component analysis (PCA) which is suited for real-time big data applications. We maintain a sliding window (representing incoming data in real-time) over the most recent data and project every incoming data into lower dimensional subspace. Our goal is to minimize the reconstruction error of the output from the input and keep updating the principal components depending on it. We have implemented our scalable algorithm on popular Spark framework for distributed platform and performed extensive experiments on datasets from a variety of real-time applications e.g. activity recognition, customer expenditure, etc. Furthermore, we have demonstrated that our algorithm can capture the changing distributions of real-life datasets, thus enabling real-time PCA. We have also compared the performances of distributed and non-distributed versions of our algorithm over a variety of window sizes and showed that our distributed algorithm is scalable and performs better when window size and target dimension increase.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.