Intelligent methods for improving the accuracy of prediction of rare hazardous events in railway transportation

O B Pronevich,M V Zaitsev

doi:10.21683/1729-2646-2021-21-3-54-65

Abstract

The paperAimsto examine various approaches to the ways of improving the quality of predictions and classification of unbalanced data that allow improving the accuracy of rare event classification. When predicting the onset of rare events using machine learning techniques, researchers face the problem of inconsistency between the quality of trained models and their actual ability to correctly predict the occurrence of a rare event. The paper examines model training under unbalanced initial data. The subject of research is the information on incidents and hazardous events at railway power supply facilities. The problem of unbalanced data is expressed in the noticeable imbalance between the types of observed events, i.e., the numbers of instances.Methods.While handling unbalanced data, depending on the nature of the problem at hand, the quality and size of the initial data, various Data Science-based techniques of improving the quality of classification models and prediction are used. Some of those methods are focused on attributes and parameters of classification models. Those include FAST, CFS, fuzzy classifiers, GridSearchCV, etc. Another group of methods is oriented towards generating representative subsets out of initial datasets, i.e., samples. Data sampling techniques allow examining the effect of class proportions on the quality of machine learning. In particular, in this paper, the NearMiss method is considered in detail.Results.The problem of class imbalance in respect to the analysis of the number of incidents at railway facilities has existed since 2015. Despite the decreasing share of hazardous events at railway power supply facilities in the three years since 2018, an increase in the number of such events cannot be ruled out. Monthly statistics of hazardous event distribution exhibit no trend for declines and peaks. In this context, the optimal period of observation of the number of incidents and hazardous events is a month. A visualization of the class ratio has shown the absence of a clear boundary between the members of the majority class (incidents) and those of the minority class (hazardous events). The class ratio was studied in two and three dimensions, in actual values and using the method of main components. Such “proximity” of classes is one of the causes of wrong predictions. In this paper, the authors analysed past research of the ways of improving the quality of machine learning based on unbalanced data. The terms that describe the degree of class imbalances have been defined and clarified. The strengths and weaknesses of 50 various methods of handling such data were studied and set forth. Out of the set of methods of handling the numbers of class members as part of the classification (prediction of the occurrence) of rare hazardous events in railway transportation, the NearMiss method was chosen. It allows experimenting with the ratios and methods of selecting class members. As the results of a series of experiments, the accuracy of rare hazardous event classification was improved from 0 to 70-90%.

Highlights

Проблема дисбаланса классов при анализе количества инцидентов на объектах железнодорожного транспорта существуют с 2015 года
[38] Нейронные сети с SMOTE [39] Классификатор kNN для медицингибридизации с другими ал- Набирает популярность в горитмами обучения для до- классификации классовостижения лучших результатов. го дисбаланса Обучение Гибридизация применяется симбиозу посредством с целью облегчить проблему комбинации с другими алвыборки, выбора подмноже- горитмами обучения
Случайное исключение привыборке количество немеров классов большинства

Summary

Точность прогноза Точность прогноза

Методы повышения качества классификации можно разделить на две группы: работа с признаками и параметрами, а также работа с количеством представителей классов. Что методы в настоящее время широко известны, не существует единого алгоритма их комбинированного использования для повышения качества классификации. Например, в работе [1] предложена схема, состоящая из использования комбинации алгоритмов классификации и методов отбора признаков RFE, Random Forest и Boruta, с предварительным использованием балансирования классов методами случайного сэмплирования SMOTE и ADASYN. В статье [2] продемонстрировано, что работа над пропорциями классов на сильно несбалансированных данных для некоторых моделей приводит к повышению точности классификации. 2, а также опыта исследователей, изучающих методы сэмплирования данных [51, 52, 53, 54], в качестве основного метода повышения качества классификации несбалансированных данных выбран метод NearMiss [47, 55]. Цель применения метода NearMiss – сбалансировать распределение наблюдений по классам меньшинства и большинства на основе оценки расстояния между экземплярами из разных классов

Подход на уровне данных

Риск переобучения

Экономичное обучение

Метод ансамблей

Гибридный подход

Другие методы

При недостаточной

Уменьшает ошибку неверной классификации представителя класса большинства

Повышает точность классификации представителей классов меньшинства

Анализ и выводы

Библиографический список

Вклад авторов

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Dependability	Publication Date: Sep 21, 2021
Citations: 2	License type: cc-by

R Discovery Prime

R Discovery Prime

Intelligent methods for improving the accuracy of prediction of rare hazardous events in railway transportation

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Dependability

Lead the way for us

Similar Papers

Research on water quality classification of unbalanced river image data under deep learning framework
Lin Guo ... Jiaqi He
-
Lin Guo, et. al.Lin Guo ... Jiaqi He
02 Dec 2022
02 Dec 2022

An Efficient Deep Learning Method for Encrypted Traffic Classification on the Web
Hossein Sadr ... Homayoun Beheshti
-
Hossein Sadr, et. al.Hossein Sadr ... Homayoun Beheshti
01 Apr 2020
01 Apr 2020

Improvement Performance of the Random Forest Method on Unbalanced Diabetes Data Classification Using Smote-Tomek Link
Anthony Anggrawan ... Hairani Hairani
JOIV : International Journal on Informatics Visualization | VOL. 7
Anthony Anggrawan, et. al.Anthony Anggrawan ... Hairani Hairani
28 Feb 2023
JOIV : International Journal on Informatics Visualization | VOL. 7

Synthesizing Data Using Variational Autoencoders for Handling Class Imbalanced Deep Learning
Adil Khan ... Muhammad Ahmad
-
Adil Khan, et. al.Adil Khan ... Muhammad Ahmad
01 Jan 2020
01 Jan 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Intelligent methods for improving the accuracy of prediction of rare hazardous events in railway transportation

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Dependability