Data acquisition with active and impact-sensitive instance selection

X Zhu,X Wu

doi:10.1109/ictai.2004.46

Abstract

Real-world data is never perfect and can often suffer from corruptions or missing values that may impact models created from the data. To build accurate predictive models, data acquisition is usually adopted to complete missing values in the incomplete instances. Due to the significant cost of doing so and the inherent correlations in the dataset, acquiring complete information for all instances is likely prohibitive and unnecessary. An interesting and important problem raises here is to select what kind of instances to complete so the model built from the data can receive significant improvement. We propose two solutions to resolve this problem, and the essential idea is to complete the attributes with higher impacts to the system performance. The first solution is based on an impact-sensitive instance ranking mechanism [X. Zhu et al. (2004)]. We explore the correlation between attributes and the class and use the correlation as weights of the attributes; the larger the weight, the higher the impacts of the attribute. For each incomplete instance, we sum all weights of the attributes with missing values, and the instance with larger sum appears to be more important for users to complete their missing information. In the second solution, active learning, impact-sensitive instance ranking and missing value prediction are combined for data acquisition. Experimental results from real-world datasets demonstrate the effectiveness of our strategies.

Full Text