Co-Inference of Data Mislabelings Reveals Improved Models in Genomics and Breast Cancer Diagnostics

Susanne Gerber,Charlotte Hewel,Ali Torkamani,Stanislav Sys,Illia Horenko,Lukas Pospisil

doi:10.3389/frai.2021.739432

Abstract

Mislabeling of cases as well as controls in case–control studies is a frequent source of strong bias in prognostic and diagnostic tests and algorithms. Common data processing methods available to the researchers in the biomedical community do not allow for consistent and robust treatment of labeled data in the situations where both, the case and the control groups, contain a non-negligible proportion of mislabeled data instances. This is an especially prominent issue in studies regarding late-onset conditions, where individuals who may convert to cases may populate the control group, and for screening studies that often have high false-positive/-negative rates. To address this problem, we propose a method for a simultaneous robust inference of Lasso reduced discriminative models and of latent group-specific mislabeling risks, not requiring any exactly labeled data. We apply it to a standard breast cancer imaging dataset and infer the mislabeling probabilities (being rates of false-negative and false-positive core-needle biopsies) together with a small set of simple diagnostic rules, outperforming the state-of-the-art BI-RADS diagnostics on these data. The inferred mislabeling rates for breast cancer biopsies agree with the published purely empirical studies. Applying the method to human genomic data from a healthy-ageing cohort reveals a previously unreported compact combination of single-nucleotide polymorphisms that are strongly associated with a healthy-ageing phenotype for Caucasians. It determines that 7.5% of Caucasians in the 1000 Genomes dataset (selected as a control group) carry a pattern characteristic of healthy ageing.

Highlights

The analysis of biomedical data often aims to identify a specific set of characteristics or biomarkers that will allow for the most accurate and efficient discrimination between groups
The area under curve (AUC) values obtained with the mislabeling model (1–4) introduced in the article are statistically significantly higher than the AUC values of the BI-RADS and of the ANN strategies published in the literature. These results indicate that the obtained classification model with mislabeling would have AUC values that are statistically significantly higher than the common BIRADS diagnostics
The raw datasets used in many types of biomedical data analysis are very large, making sense of them requires statistical methods able to handle data problems where the dimensionality n of their feature spaces is typically orders of magnitude larger than the number T of individuals in the groups

Summary

Introduction

The analysis of biomedical data often aims to identify a specific (small) set of characteristics or biomarkers that will allow for the most accurate and efficient discrimination between groups. Erroneous assignments occur in many studies for a variety of reasons: experimental errors, differences in platforms used to acquire data from different groups, differences in protocols used to post-process data, or intrinsic difficulties in distinguishing the case and control groups (Lam et al, 2011; ORawe et al, 2013; Ross et al, 2013; Weißbach et al, 2021) This is evident in the case of coronavirus data emerging from different sources where it has been shown that data variability is an important factor concerning the usability of such data for machine learning (Sáez et al, 2020). Assigning those individuals to the control group in a (bio)medical study can introduce a strong source of errors and can certainly have severe consequences for the patients, if this mislabeling cannot be identified, and correct medical treatment will be withheld

Objectives

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Frontiers in Artificial Intelligence	Publication Date: Jan 5, 2022
Citations: 2	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Co-Inference of Data Mislabelings Reveals Improved Models in Genomics and Breast Cancer Diagnostics

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Frontiers in Artificial Intelligence

Lead the way for us

Similar Papers

Abstract P3-02-01: The aggregate number of false-positive recalls and biopsies performed under different breast cancer screening strategies in the US
Carlie K Thompson ... Cristina O'Donoghue
Cancer Research | VOL. 75
Carlie K Thompson, et. al.Carlie K Thompson ... Cristina O'Donoghue
30 Apr 2015
Cancer Research | VOL. 75

Abstract PD2-15: Effect of mammography screening frequency on false-positive biopsy rates and detection of local recurrence among breast cancer survivors
S Yuan ... R Ha
Cancer Research | VOL. 78
S Yuan, et. al.S Yuan ... R Ha
14 Feb 2018
Cancer Research | VOL. 78

Effect of mammography screening frequency on false-positive biopsy rates and detection of local recurrence among breast cancer survivors.
Julia E Mcguinness ... Sarah Yuan
Journal of Clinical Oncology | VOL. 38
Julia E Mcguinness, et. al.Julia E Mcguinness ... Sarah Yuan
20 May 2020
Journal of Clinical Oncology | VOL. 38

Abstract 3477: 3D morphological hallmarks of breast carcinogenesis: Diagnosis of non-invasive and invasive breast cancer with Lightsheet microscopy
Octavian Bucur ... Eun-Yeong Oh
Cancer Research | VOL. 75
Octavian Bucur, et. al.Octavian Bucur ... Eun-Yeong Oh
01 Aug 2015
Cancer Research | VOL. 75

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Co-Inference of Data Mislabelings Reveals Improved Models in Genomics and Breast Cancer Diagnostics

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Frontiers in Artificial Intelligence