CfCV: Towards algorithmic debiasing in machine learning experiment

Olalekan Joseph Akintande,Olusanya Elisa Olubusoye

doi:10.1016/j.iswa.2024.200350

Abstract

Algorithmic failure in application leading to unfairness or bias typically results from data-inconsistent, diversity biases, inclusion biases, and under-representation, among many others, leading to imbalances in representation for training and validation. Cross-validation (CV) techniques are relevant in addressing this inconsistency from training to validation point. As a result, the essence of CV is to validate algorithm abilities to predict new data that are not part of the training set and to prompt issues such as overfitting or selection bias. This study considered the linkage of Inclusion, Participation, and Reciprocity (IPR) in data splitting for a sensitive attribute and caters for population grouping representativeness in splitting. The study remodified the Pre-In-Post (P-I-P) processing approach to accommodate various sensitive attribute levels within any training set and performed simulation experiments and real-life applications. It then conducted a comparative performance analysis with the two most notable CV techniques. The study innovative approach (CfCV) outperformed the existing CVs - Vfold and HoldOut; in experiments [RMSECfCV = 0.88; RMSEVfold = 0.98; RMSEHoldout = 4.69 | AccuracyCfCV = 99.75%; AccuracyVfold = 99.50%; AccuracyHoldOut = 85.50%] and applications [RMSECfCV = 0.59; RMSEVfold = 1.61; RMSEHoldout = 1.96 | AccuracyCfCV = 84%; AccuracyVfold = 81%; AccuracyHoldOut = 52%]. This study recommends the adoption of IPR in data splitting for machine experiments built for human-machine intelligence systems and concludes that machine learning experiments would be fairer if the concept of IPR formed the foundation of the human-machine intelligence framework.

Full Text