FsdaSAS: A Package for Robust Regression for Very Large Datasets Including the Batch Forward Search

Francesca Torti,Aldo Corbellini,Anthony C Atkinson

doi:10.3390/stats4020022

Francesca Torti, Aldo Corbellini + Show 1 more

Open Access

https://doi.org/10.3390/stats4020022

Copy DOI

Abstract

The forward search (FS) is a general method of robust data fitting that moves smoothly from very robust to maximum likelihood estimation. The regression procedures are included in the MATLAB toolbox FSDA. The work on a SAS version of the FS originates from the need for the analysis of large datasets expressed by law enforcement services operating in the European Union that use our SAS software for detecting data anomalies that may point to fraudulent customs returns. Specific to our SAS implementation, the fsdaSAS package, we describe the approximation used to provide fast analyses of large datasets using an FS which progresses through the inclusion of batches of observations, rather than progressing one observation at a time. We do, however, test for outliers one observation at a time. We demonstrate that our SAS implementation becomes appreciably faster than the MATLAB version as the sample size increases and is also able to analyse larger datasets. The series of fits provided by the FS leads to the adaptive data-dependent choice of maximally efficient robust estimates. This also allows the monitoring of residuals and parameter estimates for fits of differing robustness levels. We mention that our fsdaSAS also applies the idea of monitoring to several robust estimators for regression for a range of values of breakdown point or nominal efficiency, leading to adaptive values for these parameters. We have also provided a variety of plots linked through brushing. Further programmed analyses include the robust transformations of the response in regression. Our package also provides the SAS community with methods of monitoring robust estimators for multivariate data, including multivariate data transformations.

Highlights

Data frequently contain outlying observations, which need to be recognised and perhaps modelled
The data movements are drastically limited and the processing time is reduced; A data step only reads the data that it needs in the memory and leaves out the data that it does not need in the source; data are indexed to allow for faster retrieval from datasets; in regression and other predictive modelling methods, multi-threading is applied whenever this is appropriate for the analysis
We have demonstrated that the value k = 10 has a negligible effect on the results of statistical analyses for sample sizes where the batch procedure yields a significant reduction in computational time

Summary

Introduction

Data frequently contain outlying observations, which need to be recognised and perhaps modelled. A SAS version of the program, fsdaSAS (https://github.com/UniprJRC/FSDAsas accessed on 12 March 2021), which takes advantage of the file handling capabilities of SAS to increase the size of datasets that can be analysed and to decrease computation time for large problems. The purposes of the present paper are to introduce a set of SAS programs for robust data analysis, to provide a description of the batch forward search and to illustrate its properties on a previously unanalysed large data example. The third complements our paper with a software survey for robust statistical analyses with our fsdaSAS package

Algebra for the Forward Search

Why SAS?

FS Analysis of the Transformed Loyalty Card Data

Loyalty card data:plots monitoring plots for the transformed data

The FS Batch Procedure

Timing Comparisons

Balance Sheet Data—A Large dataset

Discussion and Extensions

Three Classes of Estimator for Robust Regression

Monitoring and Graphics

Findings

Programs

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Stats	Publication Date: Apr 18, 2021
Citations: 6	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

FsdaSAS: A Package for Robust Regression for Very Large Datasets Including the Batch Forward Search

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Stats

Lead the way for us

Similar Papers

Meeting Big Data challenges with visual analytics
Victoria Louise Lemieux ... Dr Anne Thurston
Records Management Journal | VOL. 24
Victoria Louise Lemieux, et. al.Victoria Louise Lemieux ... Dr Anne Thurston
15 Jul 2014
Records Management Journal | VOL. 24

Secondary Data Analysis of Large Data Sets in Urology: Successes and Errors to Avoid
Bruce J Schlomer ... Hillary L Copp
Investigative urology | VOL. 191
Bruce J Schlomer, et. al.Bruce J Schlomer ... Hillary L Copp
17 Oct 2013
Investigative urology | VOL. 191

Phylogenetic Analyses of Large Data Sets: Approaches Using the Angiosperms
Douglas E. Soltis ... Pamela S. Soltis
-
Douglas E. Soltis, et. al.Douglas E. Soltis ... Pamela S. Soltis
01 Jan 1999
01 Jan 1999

Random Start Forward Searches with Envelopes for Detecting Clusters in Multivariate Data
Anthony Atkinson ... Marco Riani
-
Anthony Atkinson, et. al.Anthony Atkinson ... Marco Riani
01 Jan 2006
01 Jan 2006

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

FsdaSAS: A Package for Robust Regression for Very Large Datasets Including the Batch Forward Search

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Stats