Finding Multidimensional Simpson's Paradox

Jay Xu,Jian Pei,Zicun Cong

doi:10.1145/3575637.3575645

Abstract

Finding and analyzing Simpson's paradox, a well known statistical phenomenon, has found many applications. While the existing literature focuses on only analyzing the causes of identi ed Simpson's paradox, there is no systematic analysis on Simpson's paradox in multidimensional spaces. In this paper, we develop a simple yet practical approach to automatically identify all Simpson's paradox instances formed by various sub-populations and separator attributes in a multidimensional data set. Moreover, we analyze the distribution of the multidimensional Simpson's paradox instances on three real data sets with respect to dimensionality, size of sub-populations, participation of individual records, redundancy, and more. We obtain a series of interesting observations about a few questions that have never been asked before. The results open doors to a few interesting directions for future study. Moreover, this paper is an outcome from a high-school student summer research internship. It re ects our on-going e ort in promoting data science research to youth and high school students.

Full Text