Outlier Detection by Regression Diagnostics in Large Data

A.A.M Nurunnabi,Mohammed Nasser

doi:10.1109/icfcc.2009.60

Abstract

Regression analysis is a well known supervised learning technique. To estimate and justify an effective model from regression analysis it is necessary to check and preprocess the data set. Without outliers (noise) it is impossible to get a real data. Areas in bio-informatics, astronomy, image analysis, computer vision etc, large or fat data appear with unusual observations (outliers) very naturally. In these industries robust regression are commonly used in model building process. But robust regression methods are not good enough in large and/or high dimensional data. Checking raw data for outliers in regression is regression diagnostics. Robust regression and regression diagnostics are two complementary ideas and any one is not enough for studying a contaminated data. Most of the popular diagnostic methods are not sufficient for large data because of masking and swamping. In this article, both of the above ideas are shortly discussed and we show a new measure can effectively identify outliers (influential observations) in linear regression for large data.

Full Text