Abstract

Informative data analysis relies heavily on the quality of the underlying data. Unfortunately, often in our research, the data to be analyzed contains many missing values. While we have methods to mitigate the missing data – listwise deletion, multiple imputation, etc. - these methods are only appropriate for use when data are missing at random. When data are missing not at random, use of these methods leads to erroneous analyses. Determining whether a data set contains random or non-random missing data is an open challenge in our field. An algorithm to categorize missing data utilizing the Lempel-Ziv (LZ) complexity score is proposed by the authors and initial results from its use in both generated and publicly available data are analyzed. The authors’ algorithm contains many positive features. It is useful with data sets of all compositions (string, numerical, graphics, mixed), yields easily interpreted results, and can be used autonomously to determine the type of missingness (random versus non-random). The authors review related literature, explain the algorithm, and interpret initial results of its use with data from canonical Bayesian networks, United States census data, and data sets from the University of California, Irvine machine learning repository. Further usages in the field of bioinformatics and pathways for future research are discussed.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call