Abstract

The quality of training data plays a decisive role in the establishment of intelligent models. Since raw data obtained from the real world are usually entwined with noise due to variety of causes, noise filtering has become an important aspect of machine learning techniques. In contrast with the extensive research conducted on noise elimination for classification purposes, papers addressing this problem for regression tasks are rather scarce. In this paper, we propose a novel noise filter to clean noisy instances with real-valued label noise. Aiming at the deficiency of the existing noise determination criterion, a new adaptive threshold-based method is first proposed. It allows a noisy instance to be adaptively defined according to the fitting difficulty levels of different datasets, and areas with different densities. Embedded with this criterion, an effective noise filtering procedure is also designed. An ensemble filtering scheme and an iterative filtering process are combined to detect as many potential noisy samples as possible from the original training set. According to the acquire noise detection information, a noise score for evaluating the noise level is specifically developed. The potential noisy samples whose scores exceed a reasonable threshold are further filtered, which can compensate for the possible errors incurred during the previous procedure, and contribute to more reliable filtering results. The validity of the proposed method is studied in exhaustive experiments. We discuss reasonable hyperparameters, and compare the developed method with several state-of-the-art noise filters. The outcomes show that the prediction accuracy of the utilized regressor can greatly benefit from preprocessing the given raw dataset by using our method. Simultaneously, the method is able to acquire a good balance between the elimination of noisy samples and the retention of clean samples, and consistently achieves a better noise filtering performance.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call