Challenges of cellwise outliers

  • Abstract
  • Highlights & Summary
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

It is well-known that real data often contain outliers. The term outlier usually refers to a case, usually denoted by a row of the n×d data matrix. In recent times a different type has come into focus, the cellwise outliers. These are suspicious cells (entries) that can occur anywhere in the data matrix. Even a relatively small proportion of outlying cells can contaminate over half the cases, which is a problem for robust methods. This article discusses the challenges posed by cellwise outliers, and some methods developed so far to deal with them. New results are obtained on cellwise breakdown values for location, covariance and regression. A cellwise robust method is proposed for correspondence analysis, with real data illustrations. The paper concludes by formulating some points for debate.

Similar Papers
  • Research Article
  • Cite Count Icon 2
  • 10.15672/hujms.734212
Robust regression estimation and variable selection when cellwise and casewise outliers are present
  • Feb 4, 2021
  • Hacettepe Journal of Mathematics and Statistics
  • Onur Toka + 2 more

Two main issues regarding a regression analysis are estimation and variable selection in presence of outliers. Popular robust regression estimation methods are combined with variable selection methods to simultaneously achieve robust estimation and variable selection. However, recent works showed that the robust estimation methods used in those estimation and variable selection procedures are only resistant to the casewise (rowwise) outliers in the data. Therefore, since these robust variable selection methods may not be able to cope with cellwise outliers in the data, some extra care should be taken when cellwise outliers are present along with the casewise outliers. In this study, we proposed a robust estimation and variable selection method to deal with both cellwise and casewise outliers in the data. The proposed method has three steps. In the first step, cellwise outliers were identified, deleted and marked with NA sign in each explanatory variable. In the second step, the cells with NA signs were imputed using a robust imputation method. In the last step, robust regression estimation methods were combined with the variable selection method LASSO (Least Angle Solution and Selection Operator) to estimate the regression parameters and to select remarkable explanatory variables. The simulation results and real data example revealed that the proposed estimation and variable selection procedure perform well in the presence of cellwise and casewise outliers.

  • Research Article
  • Cite Count Icon 4
  • 10.1007/s11634-021-00436-9
Robust regression with compositional covariates including cellwise outliers
  • Feb 24, 2021
  • Advances in Data Analysis and Classification
  • Nikola Štefelová + 4 more

We propose a robust procedure to estimate a linear regression model with compositional and real-valued explanatory variables. The proposed procedure is designed to be robust against individual outlying cells in the data matrix (cellwise outliers), as well as entire outlying observations (rowwise outliers). Cellwise outliers are first filtered and then imputed by robust estimates. Afterwards, rowwise robust compositional regression is performed to obtain model coefficient estimates. Simulations show that the procedure generally outperforms a traditional rowwise-only robust regression method (MM-estimator). Moreover, our procedure yields better or comparable results to recently proposed cellwise robust regression methods (shooting S-estimator, 3-step regression) while it is preferable for interpretation through the use of appropriate coordinate systems for compositional data. An application to bio-environmental data reveals that the proposed procedure—compared to other regression methods—leads to conclusions that are best aligned with established scientific knowledge.

  • Research Article
  • Cite Count Icon 11
  • 10.52933/jdssv.v1i3.18
Handling Cellwise Outliers by Sparse Regression and Robust Covariance
  • Dec 3, 2021
  • Journal of Data Science, Statistics, and Visualisation
  • Jakob Raymaekers + 1 more

We propose a data-analytic method for detecting cellwise outliers. Given a robust covariance matrix, outlying cells (entries) in a row are found by the cellFlagger technique which combines lasso regression with a stepwise application of constructed cutoff values. The penalty term of the lasso has a physical interpretation as the total distance that suspicious cells need to move in order to bring their row into the fold. For estimating a cellwise robust covariance matrix we construct a detection-imputation method which alternates between flagging outlying cells and updating the covariance matrix as in the EM algorithm. The proposed methods are illustrated by simulations and on real data about volatile organic compounds in children.

  • Research Article
  • Cite Count Icon 22
  • 10.1016/j.csda.2016.01.004
Robust regression estimation and inference in the presence of cellwise and casewise contamination
  • Jan 19, 2016
  • Computational Statistics & Data Analysis
  • Andy Leung + 2 more

Robust regression estimation and inference in the presence of cellwise and casewise contamination

  • Abstract
  • 10.1080/00273170802640533
Abstract: Local Influence and Robust Methods for Mediation Models
  • Dec 19, 2008
  • Multivariate Behavioral Research
  • Jiyun Zu + 1 more

Mediation analysis investigates how certain variables mediate the effect of predictors on outcome variables. Existing studies of mediation models have been limited to normal theory maximum likelihood (ML) or least squares with normally distributed data. Because real data in the social and behavioral sciences are seldom normally distributed and often contain outliers, classical methods can result in biased and inefficient estimates, which lead to inaccurate or unreliable test of the meditated effect. The authors propose two approaches for better mediation analysis. One is to identify cases that strongly affect test results of mediation using local influence methods and robust methods. The other is to use robust methods for parameter estimation, and then test the mediated effect based on the robust estimates. Analytic details of both local influence and robust methods particular for mediation models were provided and one real data example was given. We first used local influence and robust methods to identify influential cases. Then, for the original data and the data with the identified influential cases removed, the mediated effect was tested using two estimation methods: normal theory ML and the robust method, crossing two tests of mediation: the Sobel (1982) test using information-based standard error (z I ) and sandwich-type standard error (z SW ). Results show that local influence and robust methods rank the influence of cases similarly, while the robust method is more objective. The widely used z I statistic is inflated when the distribution is heavy-tailed. Compared to normal theory ML, the robust method provides estimates with smaller standard errors and more reliable test.

  • Research Article
  • Cite Count Icon 24
  • 10.1016/j.csda.2017.02.007
Multivariate location and scatter matrix estimation under cellwise and casewise contamination
  • Feb 15, 2017
  • Computational Statistics & Data Analysis
  • Andy Leung + 2 more

Multivariate location and scatter matrix estimation under cellwise and casewise contamination

  • Research Article
  • Cite Count Icon 5
  • 10.1142/s0218001403002861
ROBUST CORRESPONDENCE METHODS FOR STEREO VISION
  • Nov 1, 2003
  • International Journal of Pattern Recognition and Artificial Intelligence
  • Matthew P Eklund + 2 more

Correspondence is one of the major problems that must be solved in stereo vision. Correlation has been commonly used in the past for this problem. However, most classical linear correlation methods fail near depth discontinuities and in the presence of occlusions. Many robust methods have been proposed that claim to effectively deal with some or all of these issues. Many of these robust methods are transformation-based, however, other robust methods are non-transformation based. This paper gives five requirements that should be met by a transformation-based robust correlation method. We compare some of the robust correspondence methods and demonstrate their utility on different data sets. Based on these results, we propose a solution to the correspondence problem which represents a compromise between the speed of classical correlation and the improved results obtained from a more robust correspondence method. Also, we propose a median filtering technique that removes noise from the disparity maps while preserving certain image features usually removed by ordinary median filtering.

  • Research Article
  • Cite Count Icon 2
  • 10.1080/03610918.2019.1659968
Combining empirical likelihood and robust estimation methods for linear regression models
  • Sep 5, 2019
  • Communications in Statistics - Simulation and Computation
  • Şenay Özdemir + 1 more

Ordinary least square (OLS) and robust methods are used for estimating the parameters of a linear regression model. These methods perform well under some distributional assumptions which may not be appropriate for some data sets. Therefore, nonparametric methods like Empirical likelihood (EL) may be considered. The EL method maximizes an EL function under some constraints. We consider the EL method with robustified constraints using M estimation method. We provide a small simulation study and a real data example to demonstrate the capability of robust EL method and results reveal that robust constraints are needed when outliers are resent in data.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 18
  • 10.1007/s11004-020-09861-6
Multivariate Outlier Detection in Applied Data Analysis: Global, Local, Compositional and Cellwise Outliers
  • Apr 2, 2020
  • Mathematical Geosciences
  • Peter Filzmoser + 1 more

Outliers are encountered in all practical situations of data analysis, regardless of the discipline of application. However, the term outlier is not uniformly defined across all these fields since the differentiation between regular and irregular behaviour is naturally embedded in the subject area under consideration. Generalized approaches for outlier identification have to be modified to allow the diligent search for potential outliers. Therefore, an overview of different techniques for multivariate outlier detection is presented within the scope of selected kinds of data frequently found in the field of geosciences. In particular, three common types of data in geological studies are explored: spatial, compositional and flat data. All of these formats motivate new outlier concepts, such as local outlyingness, where the spatial information of the data is used to define a neighbourhood structure. Another type are compositional data, which nicely illustrate the fact that some kinds of data require not only adaptations to standard outlier approaches, but also transformations of the data itself before conducting the outlier search. Finally, the very recently developed concept of cellwise outlyingness, typically used for high-dimensional data, allows one to identify atypical cells in a data matrix. In practice, the different data formats can be mixed, and it is demonstrated in various examples how to proceed in such situations.

  • Research Article
  • Cite Count Icon 1
  • 10.19113/sdufenbed.1141519
Vulnerability of the Tukey M Robust Regression Method Against Multicollinearity
  • Apr 25, 2023
  • Süleyman Demirel Üniversitesi Fen Bilimleri Enstitüsü Dergisi
  • Filiz Karadağ + 1 more

In this study, we investigate whether the Tukey M robust regression method provides a solution for the data sets suffering from multicollinearity problem. It is observed that high values of variance inflation factors (VIF) which is a sign of the multiple linear link among the explanatory variables, cannot be controlled by the robust methods which work through the residual values. The reason for this fact is that multicollinearity and high values of VIF which is a result of multicollinearity do not produce extreme residuals. For this reason, the robust methods cannot provide a solution for the high VIF problem. This fact is shown by an extensive simulation study. In the simulation study, the explanatory variables were derived from trivariate normal distribution for three different correlation values. In this study, we also used two real-life data examples and we observed that the results support the findings of the simulation study. For all these reasons, we can conclude that specialized methods should be utilized in the case of multicollinearity.

  • Research Article
  • Cite Count Icon 14
  • 10.1111/bmsp.12230
An overview of applied robust methods.
  • Jan 29, 2021
  • British Journal of Mathematical and Statistical Psychology
  • Ke‐Hai Yuan + 1 more

Data in social sciences are typically non-normally distributed and characterized by heavy tails. However, most widely used methods in social sciences are still based on the analyses of sample means and sample covariances. While these conventional methods continue to be used to address new substantive issues, conclusions reached can be inaccurate or misleading. Although there is no 'best method' in practice, robust methods that consider the distribution of the data can perform substantially better than the conventional methods. This article gives an overview of robust procedures, emphasizing a few that have been repeatedly shown to work well for models that are widely used in social and behavioural sciences. Real data examples show how to use the robust methods for latent variable models and for moderated mediation analysis when a regression model contains categorical covariates and product terms. Results and logical analyses indicate that robust methods yield more efficient parameter estimates, more reliable model evaluation, more reliable model/data diagnostics, and more trustworthy conclusions when conducting replication studies. R and SAS programs are provided for routine applications of the recommended robust method.

  • Research Article
  • Cite Count Icon 4
  • 10.1007/s11095-021-03110-z
A robust method for the assessment of average bioequivalence in the presence of outliers and skewness.
  • Oct 1, 2021
  • Pharmaceutical Research
  • Divan Aristo Burger + 2 more

In this paper, we propose a robust Bayesian method for the assessment of average bioequivalence based on data from conventional crossover studies. We evaluate and motivate empirically the need for robust methods in bioequivalence studies by comparing the results of robust and conventional statistical methods in a large data pool of bioequivalence studies. Robustness of the statistical methodology is achieved by replacing the normal distributions for residuals in the linear mixed model with skew-t distributions. In this way, the statistical model can accommodate skew and heavy-tailed data, particularly outliers, yielding robust statistical inference without the need for excluding outliers from the analysis. We performed a simulation study to investigate and compare the performance of the robust and conventional models. Our study shows that in some trials, the distribution of residuals is skew and heavy-tailed. In the presence of outliers, the 90% confidence intervals for the ratio of geometric means tend to be narrower for the robust methods than for the conventional method. Our simulation study shows that the robust method has suitable frequentist properties and yields more precise confidence intervals and higher statistical power than the conventional maximum likelihood method when outliers are present in the data. As a sensitivity analysis, we recommend the fit of robust models for handling outliers that are occasionally encountered in crossover design bioequivalence data.

  • Research Article
  • Cite Count Icon 7
  • 10.1002/cem.3182
Cellwise outlier detection and biomarker identification in metabolomics based on pairwise log ratios
  • Dec 2, 2019
  • Journal of Chemometrics
  • Jan Walach + 4 more

Data outliers can carry very valuable information and might be most informative for the interpretation. Nevertheless, they are often neglected. An algorithm called cellwise outlier diagnostics using robust pairwise log ratios (cell‐rPLR) for the identification of outliers in single cell of a data matrix is proposed. The algorithm is designed for metabolomic data, where due to the size effect, the measured values are not directly comparable. Pairwise log ratios between the variable values form the elemental information for the algorithm, and the aggregation of appropriate outlyingness values results in outlyingness information. A further feature of cell‐rPLR is that it is useful for biomarker identification, particularly in the presence of cellwise outliers. Real data examples and simulation studies underline the good performance of this algorithm in comparison with alternative methods.

  • Research Article
  • 10.1002/cjs.11649
Cellwise outlier detection with false discovery rate control
  • Aug 14, 2021
  • Canadian Journal of Statistics
  • Yanhong Liu + 4 more

This article is concerned with detecting cellwise outliers in large data matrices. We introduce a novel method that is able to fully exploit dependence structures among variables while controlling the false discovery rate (FDR). We reframe cellwise outlier identification into a high‐dimensional variable selection paradigm and construct “binate references” for data screening, estimation and information pooling. With the binate references, the proposed procedure forms a series of statistics that incorporate covariance information and utilizes a global symmetry property of these statistics to approximate the false discovery proportion. We show that the proposed method can control the asymptotic FDR under some mild conditions. Extensive numerical studies demonstrate that our method has reasonable FDR control and satisfactory power in comparison to existing methods.

  • Research Article
  • Cite Count Icon 1
  • 10.35378/gujs.642935
Adaptive Reweighted Minimum Vector Variance Estimator of Covariance Used for as a New Robust Approach to Partial Least Squares Regression
  • Dec 1, 2020
  • Gazi University Journal of Science
  • Esra Polat + 1 more

Partial Least Squares Regression (PLSR), which is developed as partial type of the least squares estimator of regression in case of multicollinearity existence among independent variables, is a linear regression method. If there are outliers in data set, robust methods can be applied for diminishing or getting rid of the negative impacts of them. Past studies have shown that if the covariance matrix is appropriately robustified, PLS1 algorithm (PLSR for one dependent variable) becomes robust against outliers. In this study, an adaptive reweighted estimator of covariance based on Minimum Vector Variance as the first estimator is used and a new robust PLSR method: “PLS-ARWMVV“ is introduced. PLS-ARWMVV is compared with ordinary PLSR and four popular robust PLSR methods. The simulation and real data application are revealed that if there are contaminated observations, proposed robust PLS-ARWMVV is robust and efficient. It generally performs better than robust PRM and good alternative for other robust PLS-KurSD, RSIMPLS and PLS-SD methods.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.