A statistical framework for labeling unlabelled data: a case study on anomaly detection in pressurization systems for high-speed railway trains

Enrico De Santis,Francesco Arno,Antonello Rizzi,Alessio Martino

doi:10.1109/ijcnn55064.2022.9892880

Abstract

The ability to perform predictive maintenance, as one of the main asset of Industry 4.0, is known to help improve downtime, costs, control and production quality. Modern predictive maintenance programs involve machine learning techniques, within the AI umbrella, that work in a data-driven fashion. This is true in all machinery where, through intelligent sensors, it is possible to collect data to be processed to detect faults or carry out anomaly detection activities. This paper presents a system for the detection of anomalies in the railway context and, specifically, in the pressurization systems of Italian high-speed trains. The available real-world dataset is in form of unlabeled time series of fixed length of 600 samples. Hence, it is proposed a two-stage machine learning workflow where the first stage acts in an unsupervised fashion through a statistical technique validated by field experts with the aim of building a labeled dataset. In the second stage, the faced problem is conceived as a classification task in the context of a strong class imbalance problem - very likely in predictive maintenance - where are compared two feature engineering techniques. The first one considers directly the raw signals as input of a SVM algorithm. In the second, time series are subjected to an adaptive heuristic procedure of piece-wise approximation, whose output is a sequence of <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$\mathbb{R}^{2}$</tex> vectors (slopes and intercepts). In this case, the classification task is carried out in the so-called “dissimilarity space” for pattern recognition adopting different dimensions of the representation set obtained through a clustering algorithm. The dissimilarity measure consists of an ad-hoc edit distance capable of measuring the dissimilarity between 2-dimensional sequences. In this study a k-medoids clustering procedure is adopted for balancing the dataset together with further additional techniques for solving the challenging problem of unbalanced data, offering a deep comparison related to various experimental methodologies.

Full Text