Abstract

Time series data are becoming increasingly important due to the interconnectedness of the world. Classical problems, which are getting bigger and bigger, require more and more resources for their processing, and Big Data technologies offer many solutions. Although the principal algorithms for traditional vector-based problems are available in Big Data environments, the lack of tools for time series processing in these environments needs to be addressed. In this work, we propose a scalable and distributed time series transformation for Big Data environments based on well-known time series features (SCMFTS), which allows practitioners to apply traditional vector-based algorithms to time series problems. The proposed transformation, along with the algorithms available in Spark, improved the best results in the state-of-the-art on the Wearable Stress and Affect Detection dataset, which is the biggest publicly available multivariate time series dataset in the University of California Irvine (UCI) Machine Learning Repository. In addition, SCMFTS showed a linear relationship between its runtime and the number of processed time series, demonstrating a linear scalable behavior, which is mandatory in Big Data environments. SCMFTS has been implemented in the Scala programming language for the Apache Spark framework, and the code is publicly available.

Highlights

  • Nowadays, we can find devices generating data anywhere and at any time [2]

  • We propose a scalable and distributed time series transformation based on well-known time series features, named SCMFTS, to provide an alternative vectorbased representation of time series that enables the use of the traditional machine learning techniques available in Big Data environments

  • A high number of data points generates high runtimes, but if we compare runtimes for variables c_ACCx, c_ACCy, or c_ACCz with w_BVP, this does not happen. It is so because of the differences in the frequency value of these variables, which is included in the time series features calculation affecting the runtime. These phenomena are not related to the Spark implementation performed, but it depends on the structure of the input time series

Read more

Summary

Introduction

We can find devices generating data anywhere and at any time [2]. With the expansion of new technologies, the volume of data generated is growing by leaps and bounds. We propose a scalable and distributed time series transformation based on well-known time series features, named SCMFTS, to provide an alternative vectorbased representation of time series that enables the use of the traditional machine learning techniques available in Big Data environments. We have implemented it in Apache Spark through Scala, guaranteeing a fully scalable behavior, being the first proposal of this type made for Big Data environments. SCMFTS allows practitioners to face problems that would otherwise be impossible and to improve the results obtained through the additional information provided by the new time series features.

Time Series in Big Data
Big Data Frameworks
Scalable and Distributed Time Series Transformation Proposal
Transformed Data 5
Experimental Design
Datasets
Measures and Methodology
Models
Hardware and Software
Results
Performance Results on WESAD
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call