Анализ существующих методов снижения размерности входных данных

Sergey D Erokhin,Ivan D Martishin,Boris B Borisenko,Alexander S Fadeev

doi:10.36724/2072-8735-2022-16-1-30-37

Abstract

The explosive growth of data arrays, both in the number of records and in attributes, has triggered the development of a number of platforms for handling big data (Amazon Web Services, Google, IBM, Infoworks, Oracle, etc.), as well as parallel algorithms for data analysis (classification, clustering, associative rules). This, in turn, has prompted the use of dimensionality reduction techniques. Feature selection, as a data preprocessing strategy, has proven to be effective and efficient in preparing data (especially high-dimensional data) for various data collection and machine learning tasks. Dimensionality reduction is not only useful for speeding up algorithm execution, but can also help in the final classification/clustering accuracy. Too noisy or even erroneous input data often results in less than desirable algorithm performance. Removing uninformative or low-informative columns of data can actually help the algorithm find more general areas and classification rules and generally achieve better performance. This article discusses commonly used data dimensionality reduction methods and their classification. Data transformation consists of two steps: feature generation and feature selection. A dis tinction is made between scalar feature selection and vector methods (wrapper methods, filtering methods, embedded methods and hybrid methods). Each method has its own advantages and disadvantages, which are outlined in the article. It describes the application of one of the most effective methods of dimensionality reduction - the method of correspondence analysis for CSE-CIC-IDS2018 dataset. The effectiveness of this method in the tasks of dimensionality reduction of the specified dataset in the detection of computer attacks is checked.

Full Text