Attribute reduction of data with error ranges and test costs

Fan Min,William Zhu

doi:10.1016/j.ins.2012.04.031

Abstract

In data mining applications, we have a number of measurement methods to obtain a data item with different test costs and different error ranges. Test costs refer to time, money, or other resources spent in obtaining data items related to some object; observational errors correspond to differences in measured and true value of a data item. In supervised learning, we need to decide which data items to obtain and which measurement methods to employ, so as to minimize the total test cost and help in constructing classifiers. This paper studies this problem in four steps. First, data models are built to address error ranges and test costs. Second, error-range-based covering rough set is constructed to define lower and upper approximations, positive regions, and relative reducts. A closely related theory deals with neighborhood rough set, which has been successfully applied to heterogeneous attribute reduction. The major difference between the two theories is the definition of neighborhood. Third, the minimal test cost attribute reduction problem is redefined in the new theory. Fourth, both backtrack and heuristic algorithms are proposed to deal with the new problem. The algorithms are tested on ten UCI (University of California – Irvine) datasets. Experimental results show that the backtrack algorithm is efficient on rational-sized datasets, the weighting mechanism for the heuristic information is effective, and the competition approach can improve the quality of the result significantly. This study suggests new research trends concerning attribute reduction and covering rough set.

Full Text