Analyzing the impact of missing values and selection bias on fairness

Yanchen Wang,Lisa Singh

doi:10.1007/s41060-021-00259-z

Yanchen Wang, Lisa Singh

Open Access

https://doi.org/10.1007/s41060-021-00259-z

Copy DOI

Abstract

Algorithmic decision making is becoming more prevalent, increasingly impacting people’s daily lives. Recently, discussions have been emerging about the fairness of decisions made by machines. Researchers have proposed different approaches for improving the fairness of these algorithms. While these approaches can help machines make fairer decisions, they have been developed and validated on fairly clean data sets. Unfortunately, most real-world data have complexities that make them more dirty. This work considers two of these complexities by analyzing the impact of two real-world data issues on fairness—missing values and selection bias—for categorical data. After formulating this problem and showing its existence, we propose fixing algorithms for data sets containing missing values and/or selection bias that use different forms of reweighting and resampling based upon the missing value generation process. We conduct an extensive empirical evaluation on both real-world and synthetic data using various fairness metrics, and demonstrate how different missing values generated from different mechanisms and selection bias impact prediction fairness, even when prediction accuracy remains fairly constant.

Highlights

In today’s big data world, algorithmic decision making is becoming more pervasive in areas that impact our everyday lives, including hiring, credit approval, and criminal justice
We begin by expanding the framing of fairness to consider three missing value mechanisms, missing at random (MAR), missing not at random (MNAR), and missing completely at random (MCAR)
We propose fixing algorithms to mitigate the negative effects resulting from missing values and selection bias

Summary

Introduction

In today’s big data world, algorithmic decision making is becoming more pervasive in areas that impact our everyday lives, including hiring, credit approval, and criminal justice. As more applications use algorithmic decision making, there are growing concerns about their transparency, accountability, and fairness [7,16,49]. In the USA, the Civil Rights Act of 1964 prohibits discrimination of people based on race, color, religion, sex, or national origin. These demographic traits are examples of sensitive/protected attributes or attributes that should not be dominant features used by machine learning algorithms to make predictions. Sensitive attributes are identified based on the task being conducted and the established legal frame-

Objectives

Results

Conclusion