Abstract

Finding and analyzing Simpson's paradox, a well known statistical phenomenon, has found many applications. While the existing literature focuses on only analyzing the causes of identi ed Simpson's paradox, there is no systematic analysis on Simpson's paradox in multidimensional spaces. In this paper, we develop a simple yet practical approach to automatically identify all Simpson's paradox instances formed by various sub-populations and separator attributes in a multidimensional data set. Moreover, we analyze the distribution of the multidimensional Simpson's paradox instances on three real data sets with respect to dimensionality, size of sub-populations, participation of individual records, redundancy, and more. We obtain a series of interesting observations about a few questions that have never been asked before. The results open doors to a few interesting directions for future study. Moreover, this paper is an outcome from a high-school student summer research internship. It re ects our on-going e ort in promoting data science research to youth and high school students.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call