A Classification Algorithm Utilizing the Lempel-Ziv Complexity Score for Missing Data

Valerie Sessions,Justin Grieves,Stanley Perrine

doi:10.1007/978-3-031-35308-6_1

Abstract

Informative data analysis relies heavily on the quality of the underlying data. Unfortunately, often in our research, the data to be analyzed contains many missing values. While we have methods to mitigate the missing data – listwise deletion, multiple imputation, etc. - these methods are only appropriate for use when data are missing at random. When data are missing not at random, use of these methods leads to erroneous analyses. Determining whether a data set contains random or non-random missing data is an open challenge in our field. An algorithm to categorize missing data utilizing the Lempel-Ziv (LZ) complexity score is proposed by the authors and initial results from its use in both generated and publicly available data are analyzed. The authors’ algorithm contains many positive features. It is useful with data sets of all compositions (string, numerical, graphics, mixed), yields easily interpreted results, and can be used autonomously to determine the type of missingness (random versus non-random). The authors review related literature, explain the algorithm, and interpret initial results of its use with data from canonical Bayesian networks, United States census data, and data sets from the University of California, Irvine machine learning repository. Further usages in the field of bioinformatics and pathways for future research are discussed.

Full Text