Comment on “TVOR: Finding Discrete Total Variation Outliers Among Histograms”

Melkior Ornik

doi:10.1109/access.2021.3082900

Abstract

Recent paper “TVOR: Finding Discrete Total Variation Outliers Among Histograms” introduces the Total Variation Outlier Recognizer (TVOR) method for identification of outliers among a given set of histograms. After providing a theoretical discussion of the method and verifying its success on synthetic and population census data, it applies the TVOR model to histograms of ages of Holocaust victims produced using United States Holocaust Memorial Museum data. It purports to identify the list of victims of the Jasenovac concentration camp as potentially suspicious. In this comment paper, we show that the TVOR model and its assumptions are grossly inapplicable to the considered dataset. When applied to the considered data, the model is biased in assigning a higher outlier score to histograms of larger sizes, the set of data points is extremely sparse around the point of interest, the dataset has not been reviewed to remove obvious data processing errors, and, contrary to the model requirements, the distributions of the victims' ages naturally vary significantly across victim lists.

Highlights

F OCUSING on the problem of identifying compromised data, recently published article [1] introduces a novel method named Total Variation Outlier Recognizer (TVOR) for identification of outliers across a set of histograms
In proposing its scheme based on the difference in discrete total variations among histograms, the TVOR method critically relies on the assumption that all histograms in a dataset should come from the same probability distribution, or should at least have the same smoothness properties
By comparing histograms obtained from 7106 historical documents such as lists of ghetto inhabitants, lists of casualties, census records, and concentration camp population lists — including the lists of victims of the Jasenovac concentration camp [3] differentiated by ethnicity — the authors of [1] claim to have detected “the potentially problematic parts of a sample, which in the case of the Jasenovac list lies in the birth years of Serbian inmates” [1, Appendix D]

Summary

INTRODUCTION

F OCUSING on the problem of identifying compromised data, recently published article [1] introduces a novel method named Total Variation Outlier Recognizer (TVOR) for identification of outliers across a set of histograms. By comparing histograms obtained from 7106 historical documents such as lists of ghetto inhabitants, lists of casualties, census records, and concentration camp population lists — including the lists of victims of the Jasenovac concentration camp [3] differentiated by ethnicity — the authors of [1] claim to have detected “the potentially problematic parts of a sample, which in the case of the Jasenovac list lies in the birth years of Serbian inmates” [1, Appendix D] In this comment paper we show that the use of TVOR on the USHMM records, in the manner employed in [1], is inappropriate. Notation E[X] denotes the expected value of a random variable X on an underlying probability space

TVOR PRELIMINARIES AND USHMM DATASET

HISTOGRAM SIZE BIAS

DEARTH OF RELEVANT HISTOGRAM DATA

DATA PROCESSING AND ANALYSIS ISSUES

Findings

CONCLUSION