Abstract

The identification of atypical observations and the immunization of data analysis against both outliers and failures of modeling are important aspects of modern statistics. The forward search is a graphics rich approach that leads to the formal detection of outliers and to the detection of model inadequacy combined with suggestions for model enhancement. The key idea is to monitor quantities of interest, such as parameter estimates and test statistics, as the model is fitted to data subsets of increasing size. In this paper we propose some computational improvements of the forward search algorithm and we provide a recursive implementation of the procedure which exploits the information of the previous step. The output is a set of efficient routines for fast updating of the model parameter estimates, which do not require any data sorting, and fast computation of likelihood contributions, which do not require matrix inversion or qr decomposition. It is shown that the new algorithms enable a reduction of the computation time by more than 80%. Furthemore, the running time now increases almost linearly with the sample size. All the routines described in this paper are included in the FSDA toolbox for MATLAB which is freely downloadable from the internet.

Highlights

  • The forward search is a powerful general method for detecting anomalies in structured data (Atkinson and Riani 2000; Atkinson, Riani, and Cerioli 2004; Riani, Atkinson, and Cerioli 2009; Atkinson, Riani, and Cerioli 2010), which relies on a simple and attractive idea

  • We provide the user with: 1. An analysis of the computation time required to perform the forward search for a wide set of sample sizes; 2

  • All the routines described in this paper are included in the FSDA toolbox for MATLAB (Riani, Perrotta, and Torti 2012)

Read more

Summary

Introduction

The forward search is a powerful general method for detecting anomalies in structured data (Atkinson and Riani 2000; Atkinson, Riani, and Cerioli 2004; Riani, Atkinson, and Cerioli 2009; Atkinson, Riani, and Cerioli 2010), which relies on a simple and attractive idea. Recent applications of the forward search include systematic outlier detection in official Census data (Torti, Perrotta, Francescangeli, and Bianchi 2015), and the analysis of international trade markets (Cerioli and Perrotta 2014), where important issues such as incorrect declarations, tax evasion and money laundering are at the forefront In both these instances the number of datasets to be analyzed is of the order of hundreds of thousands, while the sample size of each dataset ranges from less than 10 observations to more than 100000. An analysis of the computation time required to perform the forward search for a wide set of sample sizes; 2 Efficient algorithms for both subset updating and computation of Mahalanobis distances or residuals; these algorithms do not need matrix inversion, but simple matrix multiplications, and avoid the use of sorting procedures; 3. The Appendix D deals with the HTML documentation of the new functions which have been written

Fast subset updating
Fast deviance measures updating
Conclusions
Procedure to find the kth order statistic
Findings
HTML help files
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call