Articles published on Cell-wise Outliers
Authors
Select Authors
Journals
Select Journals
Duration
Select Duration
31 Search results
Sort by Recency
- Research Article
- 10.1177/07591063251348789
- Jul 21, 2025
- Bulletin of Sociological Methodology/Bulletin de Méthodologie Sociologique
- Qianqian Qi + 3 more
Correspondence analysis: Handling cell-wise outliers via the reconstitution algorithm
- Research Article
- 10.3329/dujs.v73i2.82773
- Jul 12, 2025
- Dhaka University Journal of Science
- Nadia Mehjabeen Oyshi + 2 more
The proliferation of high-dimensional data has heightened challenges posed by cellwise outliers, where contamination in individual cells distorts analyses more pervasively than traditional rowwise outliers. This study conducts a comprehensive comparison of robust variable selection methods under cellwise contamination, evaluating four rank-based techniques (ALGR, ALRP, LGR, LRP) against traditional approaches (Lasso, Adaptive Lasso, sLTS). Simulations under varying correlation structures, contamination rates (2%, 5%, 10%), and outlier magnitudes (γ = 2, 6, 10) demonstrate that Gaussian Rank correlation-based methods (ALGR, LGR) achieve superior F1 scores, balancing high true positives and low false positives. Real-data applications on life expectancy and crime datasets corroborate these findings, with ALGR and LGR maintaining robustness in low- and high-dimensional settings. Results emphasize the critical need for methods resilient to cellwise contamination in fields reliant on accurate high-dimensional data analysis, such as healthcare and genomics. Dhaka Univ. J. Sci. 73(2): 143-150, 2025 (July)
- Research Article
- 10.37394/23202.2024.23.34
- Nov 25, 2024
- WSEAS TRANSACTIONS ON SYSTEMS
- Anabela Rocha + 2 more
Real-world data often violate the conditions assumed by classical estimation methods. One reason for this failure may be the presence of observations with a low probability of belonging to the same distribution as the majority of the data, known as outliers. Outliers can appear in different forms, such as casewise and cellwise outliers. The results of classical estimation methods, particularly those based on least squares, can be seriously affected by the presence of any type of outlier. Panel data modeling is applied in various fields, including economics, finance, marketing, biology, environmental studies, healthcare, and more. The estimation of these models is typically performed using classical methods. In this paper, we consider the random effects panel data model and propose a robust method to estimate the parameters of this model. To evaluate the performance of the proposed robust estimation method compared to the classical estimation method, we conducted a Monte Carlo simulation study. Additionally, we illustrate the proposed methodology by applying it to estimate a model based on a real panel data set.
- Research Article
1
- 10.3390/stats7040073
- Oct 19, 2024
- Stats
- Luca Sartore + 2 more
Outliers are typically identified using frequentist methods. The data are classified as “outliers” or “not outliers” based on a test statistic that measures the magnitude of the difference between a value and the majority part of the data. The threshold for a data value to be an outlier is typically defined by the user. However, a subjective choice of the threshold increases the uncertainty associated with outlier status for each data value. A cellwise outlier detection algorithm named FuzzyHRT is used to automate the editing process in repeated surveys. This algorithm uses Bienaymé–Chebyshev’s inequality and fuzzy logic to detect four different types of outliers resulting from format inconsistencies, historical, tail, and relational anomalies. However, fuzzy logic is not suited for probabilistic reasoning behind the identification of anomalous cells. Bayesian methods are well suited for quantifying the uncertainty associated with the identification of outliers. Although, as suggested by the literature, there exist well-developed Bayesian methods for record-level outlier detection, Bayesian methods for identifying outliers within individual records (i.e., at the cell level) remain unexplored. This paper presents two approaches from the Bayesian perspective to study the uncertainty associated with identifying outliers. A Bayesian bootstrap approach is explored to study the uncertainty associated with the output scores from the FuzzyHRT algorithm. Empirical likelihoods in a Bayesian setting are also considered for probabilistic reasoning behind the identification of anomalous cells. NASS survey data for livestock and major crop yield (such as corn) are considered for comparing the performances of the two proposed approaches with recent cellwise outlier methods.
- Research Article
- 10.1016/j.chemolab.2024.105170
- Jul 1, 2024
- Chemometrics and Intelligent Laboratory Systems
- Mia Hubert + 1 more
MacroPARAFAC for handling rowwise and cellwise outliers in incomplete multiway data
- Research Article
3
- 10.1016/j.csda.2024.107971
- Apr 30, 2024
- Computational Statistics and Data Analysis
- Peng Su + 3 more
Cellwise contamination remains a challenging problem for data scientists, particularly in research fields that require the selection of sparse features. Traditional robust methods may not be feasible nor efficient in dealing with such contaminated datasets. A robust Lasso-type cellwise regularization procedure is proposed which is coined CR-Lasso, that performs feature selection in the presence of cellwise outliers by minimising a regression loss and cell deviation measure simultaneously. The evaluation of this approach involves simulation studies that compare its selection and prediction performance with several sparse regression methods. The results demonstrate that CR-Lasso is competitive within the considered settings. The effectiveness of the proposed method is further illustrated through an analysis of a bone mineral density dataset.
- Research Article
3
- 10.1080/00949655.2023.2286316
- Feb 21, 2024
- Journal of Statistical Computation and Simulation
- Peng Su + 2 more
Cellwise outliers are widespread in real world data analysis. Traditional robust methods may fail when applied to datasets under such contamination. We introduce a variable selection procedure, that uses the Gaussian rank estimator to obtain an initial empirical covariance matrix among the response and potential predictors. We re-parameterize the classical linear regression model design matrix and the response vector such that we are able to take advantage of these robustly estimated components before applying the adaptive Lasso to obtain consistent variable selection results. The procedure is robust to cellwise outliers in low and high-dimensional settings. Empirical results show good performance compared with recently proposed robust techniques, particularly in the challenging environment when contamination rates are high but the magnitude of outliers is moderate.
- Research Article
12
- 10.1016/j.ecosta.2024.02.002
- Feb 1, 2024
- Econometrics and Statistics
- Jakob Raymaekers + 1 more
It is well-known that real data often contain outliers. The term outlier usually refers to a case, usually denoted by a row of the n×d data matrix. In recent times a different type has come into focus, the cellwise outliers. These are suspicious cells (entries) that can occur anywhere in the data matrix. Even a relatively small proportion of outlying cells can contaminate over half the cases, which is a problem for robust methods. This article discusses the challenges posed by cellwise outliers, and some methods developed so far to deal with them. New results are obtained on cellwise breakdown values for location, covariance and regression. A cellwise robust method is proposed for correspondence analysis, with real data illustrations. The paper concludes by formulating some points for debate.
- Research Article
1
- 10.1016/j.ecosta.2024.02.003
- Feb 1, 2024
- Econometrics and Statistics
- Claudio Agostinelli
Comments on “Challenges of cellwise outliers” by Jakob Raymaekers and Peter J. Rousseeuw
- Research Article
- 10.1016/j.gexplo.2023.107299
- Aug 24, 2023
- Journal of Geochemical Exploration
- Christopher Rieser + 2 more
Cell-wise outliers are outliers in single entries of a compositional data matrix, and they can lead to a certain bias in the statistical analysis. Traditional row-wise robust methods downweight outlying observations for the estimation, independent of how many or which cells of an observation are contaminated. Cell-wise robustness still makes use of the information contained in non-contaminated cells. Here, cell-wise robustness is used for the estimation of the variation and the covariance matrix. For higher dimensional data also a regularized estimator is introduced. The advantages of the cell-wise robust estimators are demonstrated in simulation experiments and in a geochemistry application in the context of clustering and principal component analysis.
- Research Article
11
- 10.52933/jdssv.v1i3.18
- Dec 3, 2021
- Journal of Data Science, Statistics, and Visualisation
- Jakob Raymaekers + 1 more
We propose a data-analytic method for detecting cellwise outliers. Given a robust covariance matrix, outlying cells (entries) in a row are found by the cellFlagger technique which combines lasso regression with a stepwise application of constructed cutoff values. The penalty term of the lasso has a physical interpretation as the total distance that suspicious cells need to move in order to bring their row into the fold. For estimating a cellwise robust covariance matrix we construct a detection-imputation method which alternates between flagging outlying cells and updating the covariance matrix as in the EM algorithm. The proposed methods are illustrated by simulations and on real data about volatile organic compounds in children.
- Research Article
1
- 10.13189/ms.2021.090505
- Sep 1, 2021
- Mathematics and Statistics
- Yik-Siong Pang + 2 more
Multivariate outliers can exist in two forms, casewise and cellwise.Data collection typically contains unknown proportion and types of outliers which can jeopardize the location estimation and affect research findings.In cases where the two coexist in the same data set, traditional distance-based trimmed mean and coordinate-wise trimmed mean are unable to perform well in estimating location measurement.Distance-based trimmed mean suffers from leftover cellwise outliers after the trimming whereas coordinate-wise trimmed mean is affected by extra casewise outliers.Thus, this paper proposes new robust multivariate location estimation known as α-distance-based trimmed median ( � (,) ) to deal with both types of outliers simultaneously in a data set.Simulated data were used to illustrate the feasibility of the new procedure by comparing with the classical mean, classical median and α-distance-based trimmed mean.Undeniably, the classical mean performed the best when dealing with clean data, but contrarily on contaminated data.Meanwhile, classical median outperformed distance-based trimmed mean when dealing with both casewise and cellwise outliers, but still affected by the combined outliers' effect.Based on the simulation results, the proposed � (,) yields better location estimation on contaminated data compared to the other three estimators considered in this paper.Thus, the proposed � (,) can mitigate the issues of outliers and provide a better location estimation.
- Research Article
- 10.1002/cjs.11649
- Aug 14, 2021
- Canadian Journal of Statistics
- Yanhong Liu + 4 more
This article is concerned with detecting cellwise outliers in large data matrices. We introduce a novel method that is able to fully exploit dependence structures among variables while controlling the false discovery rate (FDR). We reframe cellwise outlier identification into a high‐dimensional variable selection paradigm and construct “binate references” for data screening, estimation and information pooling. With the binate references, the proposed procedure forms a series of statistics that incorporate covariance information and utilizes a global symmetry property of these statistics to approximate the false discovery proportion. We show that the proposed method can control the asymptotic FDR under some mild conditions. Extensive numerical studies demonstrate that our method has reasonable FDR control and satisfactory power in comparison to existing methods.
- Research Article
4
- 10.1007/s11634-021-00436-9
- Feb 24, 2021
- Advances in Data Analysis and Classification
- Nikola Štefelová + 4 more
We propose a robust procedure to estimate a linear regression model with compositional and real-valued explanatory variables. The proposed procedure is designed to be robust against individual outlying cells in the data matrix (cellwise outliers), as well as entire outlying observations (rowwise outliers). Cellwise outliers are first filtered and then imputed by robust estimates. Afterwards, rowwise robust compositional regression is performed to obtain model coefficient estimates. Simulations show that the procedure generally outperforms a traditional rowwise-only robust regression method (MM-estimator). Moreover, our procedure yields better or comparable results to recently proposed cellwise robust regression methods (shooting S-estimator, 3-step regression) while it is preferable for interpretation through the use of appropriate coordinate systems for compositional data. An application to bio-environmental data reveals that the proposed procedure—compared to other regression methods—leads to conclusions that are best aligned with established scientific knowledge.
- Research Article
2
- 10.15672/hujms.734212
- Feb 4, 2021
- Hacettepe Journal of Mathematics and Statistics
- Onur Toka + 2 more
Two main issues regarding a regression analysis are estimation and variable selection in presence of outliers. Popular robust regression estimation methods are combined with variable selection methods to simultaneously achieve robust estimation and variable selection. However, recent works showed that the robust estimation methods used in those estimation and variable selection procedures are only resistant to the casewise (rowwise) outliers in the data. Therefore, since these robust variable selection methods may not be able to cope with cellwise outliers in the data, some extra care should be taken when cellwise outliers are present along with the casewise outliers. In this study, we proposed a robust estimation and variable selection method to deal with both cellwise and casewise outliers in the data. The proposed method has three steps. In the first step, cellwise outliers were identified, deleted and marked with NA sign in each explanatory variable. In the second step, the cells with NA signs were imputed using a robust imputation method. In the last step, robust regression estimation methods were combined with the variable selection method LASSO (Least Angle Solution and Selection Operator) to estimate the regression parameters and to select remarkable explanatory variables. The simulation results and real data example revealed that the proposed estimation and variable selection procedure perform well in the presence of cellwise and casewise outliers.
- Research Article
9
- 10.1016/j.sigpro.2020.107608
- Apr 23, 2020
- Signal Processing
- Jasin Machkour + 3 more
A robust adaptive Lasso estimator for the independent contamination model
- Research Article
18
- 10.1007/s11004-020-09861-6
- Apr 2, 2020
- Mathematical Geosciences
- Peter Filzmoser + 1 more
Outliers are encountered in all practical situations of data analysis, regardless of the discipline of application. However, the term outlier is not uniformly defined across all these fields since the differentiation between regular and irregular behaviour is naturally embedded in the subject area under consideration. Generalized approaches for outlier identification have to be modified to allow the diligent search for potential outliers. Therefore, an overview of different techniques for multivariate outlier detection is presented within the scope of selected kinds of data frequently found in the field of geosciences. In particular, three common types of data in geological studies are explored: spatial, compositional and flat data. All of these formats motivate new outlier concepts, such as local outlyingness, where the spatial information of the data is used to define a neighbourhood structure. Another type are compositional data, which nicely illustrate the fact that some kinds of data require not only adaptations to standard outlier approaches, but also transformations of the data itself before conducting the outlier search. Finally, the very recently developed concept of cellwise outlyingness, typically used for high-dimensional data, allows one to identify atypical cells in a data matrix. In practice, the different data formats can be mixed, and it is demonstrated in various examples how to proceed in such situations.
- Research Article
23
- 10.1016/j.csda.2020.106944
- Mar 4, 2020
- Computational Statistics & Data Analysis
- P Filzmoser + 4 more
Cellwise robust M regression
- Research Article
7
- 10.1002/cem.3182
- Dec 2, 2019
- Journal of Chemometrics
- Jan Walach + 4 more
Data outliers can carry very valuable information and might be most informative for the interpretation. Nevertheless, they are often neglected. An algorithm called cellwise outlier diagnostics using robust pairwise log ratios (cell‐rPLR) for the identification of outliers in single cell of a data matrix is proposed. The algorithm is designed for metabolomic data, where due to the size effect, the measured values are not directly comparable. Pairwise log ratios between the variable values form the elemental information for the algorithm, and the aggregation of appropriate outlyingness values results in outlyingness information. A further feature of cell‐rPLR is that it is useful for biomarker identification, particularly in the presence of cellwise outliers. Real data examples and simulation studies underline the good performance of this algorithm in comparison with alternative methods.
- Research Article
46
- 10.1080/00401706.2019.1677270
- Nov 1, 2019
- Technometrics
- Jakob Raymaekers + 1 more
The product moment covariance matrix is a cornerstone of multivariate data analysis, from which one can derive correlations, principal components, Mahalanobis distances and many other results. Unfortunately, the product moment covariance and the corresponding Pearson correlation are very susceptible to outliers (anomalies) in the data. Several robust estimators of covariance matrices have been developed, but few are suitable for the ultrahigh-dimensional data that are becoming more prevalent nowadays. For that one needs methods whose computation scales well with the dimension, are guaranteed to yield a positive semidefinite matrix, and are sufficiently robust to outliers as well as sufficiently accurate in the statistical sense of low variability. We construct such methods using data transformations. The resulting approach is simple, fast, and widely applicable. We study its robustness by deriving influence functions and breakdown values, and computing the mean squared error on contaminated data. Using these results we select a method that performs well overall. This also allows us to construct a faster version of the DetectDeviatingCells method (Rousseeuw and Van den Bossche 2018) to detect cellwise outliers, which can deal with much higher dimensions. The approach is illustrated on genomic data with 12,600 variables and color video data with 920,000 dimensions. Supplementary materials for this article are available online.