Abstract
Multivariate outliers can exist in two forms, casewise and cellwise. Data collection typically contains unknown proportion and types of outliers which can jeopardize the location estimation and affect research findings. In cases where the two coexist in the same data set, traditional distance-based trimmed mean and coordinate-wise trimmed mean are unable to perform well in estimating location measurement. Distance-based trimmed mean suffers from leftover cellwise outliers after the trimming whereas coordinate-wise trimmed mean is affected by extra casewise outliers. Thus, this paper proposes new robust multivariate location estimation known as α-distance-based trimmed median (<img src=image/13491675_01.gif>) to deal with both types of outliers simultaneously in a data set. Simulated data were used to illustrate the feasibility of the new procedure by comparing with the classical mean, classical median and α-distance-based trimmed mean. Undeniably, the classical mean performed the best when dealing with clean data, but contrarily on contaminated data. Meanwhile, classical median outperformed distance-based trimmed mean when dealing with both casewise and cellwise outliers, but still affected by the combined outliers' effect. Based on the simulation results, the proposed <img src=image/13491675_01.gif> yields better location estimation on contaminated data compared to the other three estimators considered in this paper. Thus, the proposed <img src=image/13491675_01.gif> can mitigate the issues of outliers and provide a better location estimation.
Highlights
The classical mean with 0% breakdown point is highly sensitive to outliers, even with only one outlier could divert the estimation from the supposed location and lead to the defect of the least-square-basedvariance [2,3,4,5,6]
Supposed that xij comes from j-dimensional feature vectors with i=1,...,nth sample vector and j=1,...,dth variable dimension, the classical Mahalanobis Squared Distance (MSD) is obtained via following Eq 1 [15]:
One should take note that when both casewise and cellwise outliers coexist in the dataset, they should be treated separately
Summary
Most parametric statistical tools were derived using the population parameters mean (μ) and (co)variance (Σ), but most of the time, these population parameters are unknown. The classical mean with 0% breakdown point is highly sensitive to outliers, even with only one outlier could divert the estimation from the supposed location and lead to the defect of the least-square-based (co)variance [2,3,4,5,6]. This situation is even complicated in the context of multivariate data as multivariate outliers are harder to be identified
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.