Abstract

In data analysis processes, the treatment of outliers in quantitative variables is very critical as it affects the quality of the conclusions. However, despite the existence of very good tools for detecting outliers, dealing with them is not always straightforward. Indeed, statisticians recommend modeling the process underlying outliers to identify the best way to deal with them. In the context of Data Science and Machine Learning, the identification of processes that generate outliers remains problematic because this work requires a visual human interpretation of certain statistical tools. The techniques proposed so far, are systematic imputations by a central tendency characteristic, usually the arithmetic mean or median. Although adapted to the framework of Data Science and Machine Learning, these different approaches cause a fundamental problem, that of modifying the distribution of the initial data. The purpose of our paper is to propose an algorithm that allows the automatic processing of outliers by a software while preserving the distributional structure of the treated variable, whatever the law of probability is. The method is based on the moustache box theory developed by John Tukey. The procedure is tested with existing real data. All treatments are performed with the R programming language.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call