Systematic auditing is essential to debiasing machine learning in biology

Fatma-Elzahraa Eid,Haitham A Elmarakeby,Yujia Alina Chan,Mahmoud Elhefnawi,Lenwood S Heath,Nadine Fornelos,Eliezer M Van Allen,Kasper Lage

doi:10.1038/s42003-021-01674-5

Abstract

Biases in data used to train machine learning (ML) models can inflate their prediction performance and confound our understanding of how and what they learn. Although biases are common in biological data, systematic auditing of ML models to identify and eliminate these biases is not a common practice when applying ML in the life sciences. Here we devise a systematic, principled, and general approach to audit ML models in the life sciences. We use this auditing framework to examine biases in three ML applications of therapeutic interest and identify unrecognized biases that hinder the ML process and result in substantially reduced model performance on new datasets. Ultimately, we show that ML models tend to learn primarily from data biases when there is insufficient signal in the data to learn from. We provide detailed protocols, guidelines, and examples of code to enable tailoring of the auditing framework to other biomedical applications.

Highlights

Biases in data used to train machine learning (ML) models can inflate their prediction performance and confound our understanding of how and what they learn
We devised a systematic auditing framework for paired-input biological ML applications, a class of ML prediction methods, which is widely harnessed in computational biology[2], where the goal is to predict the biological relationships between two entities
This has led to a great interest in developing protein–protein interaction (PPI) classifiers that learn from previously characterized interactions to infer whether a given protein pair is likely to interact based on their protein features

Summary

Introduction

Biases in data used to train machine learning (ML) models can inflate their prediction performance and confound our understanding of how and what they learn. Biological datasets often suffer from representational biases, i.e., an imbalance or inequality in how different biological entities are represented in biological data due to evolutionary redundancies, inherent over- or underrepresentation of biological entities (e.g., housekeeping genes in gene expression data and interaction hubs in protein–protein interaction [PPI] data), and/or biases specific to or induced by different experimental conditions When these biases are not identified and eliminated, the ML process can be misled such that the model learns predominantly from the biases unique to the training dataset and is not generalizable across different datasets. We devised a systematic auditing framework for paired-input biological ML applications, a class of ML prediction methods, which is widely harnessed in computational biology[2], where the goal is to predict the biological relationships between two entities We used this framework to identify biases that have confounded the ML process in three applications of great interest to the life sciences and biotechnology communities: PPIs, drugtarget bioactivity, and MHC-peptide binding[3,4,5]. Guidelines, and examples of code to enable tailoring of the auditing framework to other biomedical applications (Supplementary Notes 1 and 2)

Methods

Results

Conclusion