Data, Machine Learning, and Human Domain Experts: None Is Better than Their Collaboration

Pawan Kumar,Manmohan Sharma

doi:10.1080/10447318.2021.2002040

Abstract

ABSTRACT Software designers are coming up with tools having machine learning (ML) capabilities integrated into them. Even naive users can build ML-based models for their respective problem domains using these tools. However, a majority of the advanced ML models behave like a black box in the sense that their behavior is not easily interpretable by humans. This lack of human interpretability acts as a hurdle toward acceptance and subsequent deployment of ML-based solutions. This research work is based on an intuitive idea that in an ideal ML-based solution, the provided dataset, the ML model learned using this dataset, and human domain experts must agree in terms of their perception regarding the problem domain. The objective of this work is to propose a framework that first listens to the dataset, the ML model, and human domain experts individually, and then measure the degree of agreement between them. Listening to the dataset refers identifying important characteristics by computing information gain using entropy and Gini index. Listening to the ML model refers interpreting its decision-making behavior by computing variable importance measures. Listening to human experts refers taking feedback in terms of important features as per their perception regarding the problem domain. The task of measuring agreement between dataset, model, and human experts has been modeled as a “two-judges n-participants” rank correlation problem. The proposed approach has been evaluated on a novel problem domain of predicting joining behavior of freshmen students. A positive degree of agreement was observed between the dataset, the ML model, and human experts in terms of their perception regarding the problem domain. The agreement between the provided dataset, the learned model, and human domain experts helps verify learning acquired by the ML model against the prevailing domain knowledge. This work has the potential to form the basis for developing formal quantitative metrics for evaluating ML models in terms of reliable learning and capability to facilitate trust of human users.

Full Text