LCSS-Based Algorithm for Computing Multivariate Data Set Similarity: A Case Study of Real-Time WSN Data.

Rahim Khan,Ismail Ahmedy,Ihsan Ali,Muhammad Zakarya,Atiq Ur Rahman,Anwar Khan,Abdullah Gani,Saleh M. Altowaijri

doi:10.3390/s19010166

Abstract

Multivariate data sets are common in various application areas, such as wireless sensor networks (WSNs) and DNA analysis. A robust mechanism is required to compute their similarity indexes regardless of the environment and problem domain. This study describes the usefulness of a non-metric-based approach (i.e., longest common subsequence) in computing similarity indexes. Several non-metric-based algorithms are available in the literature, the most robust and reliable one is the dynamic programming-based technique. However, dynamic programming-based techniques are considered inefficient, particularly in the context of multivariate data sets. Furthermore, the classical approaches are not powerful enough in scenarios with multivariate data sets, sensor data or when the similarity indexes are extremely high or low. To address this issue, we propose an efficient algorithm to measure the similarity indexes of multivariate data sets using a non-metric-based methodology. The proposed algorithm performs exceptionally well on numerous multivariate data sets compared with the classical dynamic programming-based algorithms. The performance of the algorithms is evaluated on the basis of several benchmark data sets and a dynamic multivariate data set, which is obtained from a WSN deployed in the Ghulam Ishaq Khan (GIK) Institute of Engineering Sciences and Technology. Our evaluation suggests that the proposed algorithm can be approximately 39.9% more efficient than its counterparts for various data sets in terms of computational time.

Highlights

Multivariate data set similarity is an emerging area of research because such data sets are generated routinely in scientific experiments, industries, educational organizations, on the web and in databases [1]
These algorithms were tested on benchmark data sets and a real-time data set obtained from our deployed wireless sensor networks (WSNs) in Orange
longest common subsequence (LCSS) is one of the most widely used mechanisms for determining the similarity indexes of different data sets, univariate data sets, because non-metric-based approaches are insensitive to outliers where other mechanisms are

Summary

Introduction

Multivariate data set similarity is an emerging area of research because such data sets are generated routinely in scientific experiments, industries, educational organizations, on the web and in databases [1]. Multivariate time series is generated routinely in engineering, scientific, medical, academic, stock market, multimedia and industrial domains [10] These data sets are difficult to investigate and their similarity indexes are difficult to compute using existing techniques due to their multivariate nature. Benson et al [14] and Deorowicz et al [15] presented LCSS approaches based on the dynamic programming concept in k-length sub-string problems The performance of these algorithms is exceptional on small data sets, which degrade drastically within data set size and the value of k. The developed algorithm helped in extracting the most relevant features that can assist in accurately detecting the network attacks Another attempt to analyze WSN’s data set was presented by Yohei et al [17] to perform a study using LCSS in at least k-length order-isomorphic sub-strings

Objectives

Results

Conclusion