Abstract

Introduction: At present, the volume of system logs of computer systems integrated into a distributed network infrastructure makes it impossible to manually check them in real time. Typically, the structure of each log record contains the numeric value of the observed attribute and a corresponding flag to mark the record as normal or abnormal. The support vector data description algorithm demonstrates high classification accuracy even with small volumes of the training sample. A feature of the algorithm is the work with a multi-attribute dataset, where each observation contains a common classifying marking. Consequently, the problem arises of reducing the set of markings of the attributes of the initial data to one marking of the entire observation. Purpose: to investigate the accuracy of the binary classification of experimental data of the Support Vector Data Description algorithm with a small volume of the training sample, provided that the data are labeled for each attribute separately. Methods: a method is proposed for solving the problem of reducing the set of markings of the attributes of the initial data to one single marking of the entire observation by means of two approaches: "normal observation" and voting by the majority principle. Two types of data are considered: ordered in time and uniformly mixed. The classification accuracy was assessed by calculating the area under the ROC curves with cross-validation for a different number of attributes. Results: a comparative analysis of observation labeling methods showed the advantage of the "completely normal observation" approach over the "majority vote" approach without "weighting". It is shown that the classification accuracy on mixed data is 7% higher compared to the variant of data ordering in time. The accuracy of the algorithm was investigated for a different number of attributes using the "completely normal observation" approach. The maximum achieved classification accuracy was about 96% when working with 6 attributes, with uniform mixing of the input dataset. A further increase in the number of attributes leads to a decrease in the average classification accuracy due to an increase in the proportion of anomalous observations. It is shown that when using uniform mixing of input data, the gain in accuracy can be increased by 15–20%. Practical relevance: the algorithm demonstrates an exponential growth in the consumption of computing resources with an increase in the amount of input data. Discussion: to achieve the maximum classification accuracy with acceptable resource consumption, it is necessary to form a compact set of input data, which most fully reflects the functioning of the computer system in normal mode.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.