Abstract
The occurrence of tremendous developments in the field of data has led to the formation of huge volumes of data, and it is normal that this leads to the presence of outliers in this data for many reasons, which may have small or large values compared to the rest of the normal data, and the presence of outliers in the data affects the statistical analysis of this data, so we must try to reduce its impact in various ways. On the other hand, the presence of outliers may be of great benefit, for example knowledge of geological activities that precede natural disasters such as (earthquakes, forest fires, floods ... etc.). Therefore, detection of outliers is of great importance in various fields. In this research, we aim to develop easy methods for detecting outliers in big data, as the problem that this research addresses is that many of the newly developed methods for detecting outliers suffer from computational complexity or are efficient when the sample size is small. An experimental approach was used in this research by suggesting three methods for detecting outliers, the first method is based on standard deviation and was tested and compared with the normal distribution method and the z-score method. The second method depends on the maximum and minimum value of the data, and the third method depends on the range between successive data points. The results of second and third methods are compared with Hample's Test method result. The accuracy of the results is measured based on the confusion matrix. The results of the proposed methods test showed the conformity of the first method with the results of the normal distribution method and the Z-Score method, as well as the superiority of the third method over the Hample's test method. In this paper, it was concluded that the Hample's test method suffers from a serious weakness when the zero values in the data constitute more than 50% of the number of elements.
Highlights
The importance of analyzing outliers is increasing with the acceleration of development and broad jumps in information technology, as data volumes have become more inflated and complex, which requires converting these data into useful information for the decision-making process and data analysis, and this transformation process includes a very important matter, which is the concept of outliers[1], And that the issue of outliers has been taken up by many scientists and researchers in order to study the effect of these values on the accuracy of the results expected from the data analysis process[2], and among those prominent scientists who dealt with the concept of outliers is Hawkins and Freeman
The a Point-to-Standard Deviation Method (AP-SDM) method achieved completely identical results with the normal distribution method and the z-score method when applying it even when testing the method of normal distribution within (μ ± σ) and (μ ± 2σ) with the (AP-SDM) within (AP ± 1) and ((AP ± 2) respectively, an exact match was obtained when the sample size was less than 200, and very close results for the sample size greater than 200. on different sizes of data (12, 50, 100, 200, 500, 1000), as well as when applying the methods to the data of the Abu Gharib factory
For all production lines day / year at sample size 229 and all production lines month / year at sample size 12, the results of the AP-SDM method were completely identical to the normal distribution method and the z-score method
Summary
The importance of analyzing outliers is increasing with the acceleration of development and broad jumps in information technology, as data volumes have become more inflated and complex, which requires converting these data into useful information for the decision-making process and data analysis, and this transformation process includes a very important matter, which is the concept of outliers[1], And that the issue of outliers has been taken up by many scientists and researchers in order to study the effect of these values on the accuracy of the results expected from the data analysis process[2], and among those prominent scientists who dealt with the concept of outliers is Hawkins and Freeman. The outlier in a particular data set that may appear in the form of one or more values What distinguishes this value is that it is not logical in relation to the rest of the natural data, for example, it may be very large or very small compared to the mean of the data and that the existence of a unique value is of high importance. Because it has important implications in data mining as well as in analyzing medical and financial data and in the field of networks, as detection of intrusion on networks is one of the most applied topics that have gained importance in recent years [3].
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.