Abstract

Pearson’s chi-squared test can detect outliers in the data distribution of a given set of histograms. However, in fields such as demographics (for e.g. birth years), outliers may be more easily found in terms of the histogram smoothness where techniques such as Whipple’s or Myers’ indices handle successfully only specific anomalies. This paper proposes smoothness outliers detection among histograms by using the relation between their discrete total variations (DTV) and their respective sample sizes. This relation is mathematically derived to be applicable in all cases and simplified by an accurate linear model. The deviation of the histogram’s DTV from the value predicted by the model is used as the outlier score and the proposed method is named Total Variation Outlier Recognizer (TVOR). TVOR requires no prior assumptions about the histograms’ samples’ distribution, it has no hyperparameters that require tuning, it is not limited to only specific patterns, and it is applicable to histograms with the same bins. Each bin can have an arbitrary interval that can also be unbounded. TVOR finds DTV outliers easier than Pearson’s chi-squared test. In case of distribution outliers, the opposite holds. TVOR is tested on real census data and it successfully finds suspicious histograms. The source code is given at https://github.com/DiscreteTotalVariation/TVOR .

Highlights

  • Outliers can be defined as data patterns that do not conform to an expected normal data behavior [1]

  • 3) RESULTS The first experiments that were carried out consisted of taking many variously sized subsamples of the birth years from the German census of 1939, calculating the discrete total variations of their birth year histograms, and fitting the proposed method’s model in Eq (49) to the data obtained in this way

  • WORK In this paper, a method for finding discrete total variation outliers among histograms has been proposed. It scores histograms based on the deviation of their discrete total variation from its expected value

Read more

Summary

Introduction

Outliers can be defined as data patterns that do not conform to an expected normal data behavior [1]. Since identifying outliers or anomalies can often be useful, performing outlier, i.e. anomaly, detection has an important role in many data related areas. With the ever growing application of machine learning in various fields, having clean training sets, free of any unwanted outliers, can often significantly benefit the final production accuracy. In real-time applications such as network traffic or health monitoring, it is usually highly important to detect anomalies that could represent any form of unwanted behavior to prevent their potentially detrimental effects. It may be required to see which samples differ the most from the rest of the data and study them in more detail. Since there is a relatively high demand for anomaly and outlier detection methods in fields dealing with some form

Objectives
Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.