Algorithm for Handling Incomplete Data by Finding the Probability of Compensating the Missing Value with the C4.5 Classifier

Jong Chan Lee

doi:10.1166/jctn.2021.9602

Abstract

Background/Objectives: This paper deals with the handling method of incomplete training data that is partially lost. It is unavoidable in a ubiquitous environment that deals with data collected from multiple devices or from a long distance, but it is difficult to expect good results due to loss of information if the lost part is discarded. Methods/Statistical analysis: Various algorithms have been proposed to solve this problem. Among them, the algorithm for learning by changing the format of the training data to fit incomplete data has been applied to various problems and has shown good results. This data format conversion method is called a data expansion technique, and has two characteristics: the importance can be adjusted differently for each event, and a probability value can be assigned to each cardinality of each variable. Findings: The second feature, the ability to assign a probability value, was used to assign a compensation value for loss data. The first attempt to do this was to assign equal probability values to the loss values. In the classification algorithm using the entropy function, the variable containing more loss values is not selected in the upper node. However, the method of allocating the equal value had the point that the original information is ignored. Therefore, a method of obtaining the entropy probability with complete information excluding the loss value and filling it in the loss value was also proposed. This paper starts with the basic idea that this method is further developed and the lost information can be found in the area classified by the classification algorithm. In terms of implementation, the training data is divided into two parts: lost and non-lost events, and then classified using the C4.5 classification algorithm with the data that is not lost, and each classification area is obtained. Then, by using the information remaining in the event of the loss data, the classification area is sequentially traversed according to a predetermined algorithm to find the area closest to the loss event. And after expressing the value of this area as a probability, the loss value is replaced with this probability value, which means compensation for the loss value. Improvements/Applications: After compensating for the total loss values, the recovered training data is learned with one of the SVM algorithms in order to evaluate the performance of how much information leakage has been preserved, and its performance is compared. As a result of the experiment with different degrees of loss for each variable, it was confirmed that the loss of information can be minimized and used as a method of meaningful compensation.

Full Text