Systematic misestimation of machine learning performance in neuroimaging studies of depression

Claas Flint,Nils R Winter,Tim Hahn,Ronny Redlich,Simon B Eickhoff,Igor Nenadic,Xiaoyi Jiang,Scott Clark,Bernhard T Baune,David M A Mehler,Nils Opel,Udo Dannlowski,Volker Arolt,Axel Krug,Tilo Kircher,Daniel Emden,Ramona Leenings,Micah Cearns

doi:10.1038/s41386-021-01020-7

Abstract

We currently observe a disconcerting phenomenon in machine learning studies in psychiatry: While we would expect larger samples to yield better results due to the availability of more data, larger machine learning studies consistently show much weaker performance than the numerous small-scale studies. Here, we systematically investigated this effect focusing on one of the most heavily studied questions in the field, namely the classification of patients suffering from Major Depressive Disorder (MDD) and healthy controls based on neuroimaging data. Drawing upon structural MRI data from a balanced sample of N = 1868 MDD patients and healthy controls from our recent international Predictive Analytics Competition (PAC), we first trained and tested a classification model on the full dataset which yielded an accuracy of 61%. Next, we mimicked the process by which researchers would draw samples of various sizes (N = 4 to N = 150) from the population and showed a strong risk of misestimation. Specifically, for small sample sizes (N = 20), we observe accuracies of up to 95%. For medium sample sizes (N = 100) accuracies up to 75% were found. Importantly, further investigation showed that sufficiently large test sets effectively protect against performance misestimation whereas larger datasets per se do not. While these results question the validity of a substantial part of the current literature, we outline the relatively low-cost remedy of larger test sets, which is readily available in most cases.

Highlights

In psychiatry, we are witnessing an explosion of interest in machine learning (ML) and artificial intelligence for prediction and biomarker discovery, paralleling similar developments in personalized medicine [1,2,3,4]
As the regularization of the support-vector machine (SVM) is sensitive to the total number of outliers, which may increase in parallel with sample size, we conducted an additional analysis with adjusted C parameters, with the observed effect remaining constant across these analyses
Sparked by the observation that machine learning studies drawing on larger neuroimaging samples consistently showed weaker results than studies drawing on smaller ones, we drew samples of various sizes from the Predictive Analytics Competition (PAC) dataset, thereby mimicking the process by which researchers would draw samples from the population of ML studies reported in the literature

Summary

Introduction

We are witnessing an explosion of interest in machine learning (ML) and artificial intelligence for prediction and biomarker discovery, paralleling similar developments in personalized medicine [1,2,3,4]. In contrast to the majority of investigations employing classic group-level statistical inference, ML approaches aim to build models which allow for individual (i.e., single subject) predictions, enabling direct assessment of individual differences and clinical utility [5] While this constitutes a major advancement for clinical translation, recent results of large-scale investigations have given rise to a fundamental concern in the field: machine learning studies including larger samples did not yield stronger performance, but consistently showed weaker results than studies drawing on small samples, calling into question the validity and generalizability of a large number of highly published proof-of-concept studies. This is in strong contrast to the numerous smaller studies showing accuracies of 80% or more [6,7,8]

Objectives

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Neuropsychopharmacology	Publication Date: May 6, 2021
Citations: 74	License type: open-access

R Discovery Prime

R Discovery Prime

Systematic misestimation of machine learning performance in neuroimaging studies of depression

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Neuropsychopharmacology

Lead the way for us

Similar Papers

Comparison of Machine Classification Algorithms for Fibromyalgia: Neuroimages Versus Self-Report
Michael E Robinson ... Roland Staud
The Journal of Pain | VOL. 16
Michael E Robinson, et. al.Michael E Robinson ... Roland Staud
20 Feb 2015
The Journal of Pain | VOL. 16

Predicting unipolar and bipolar depression using inflammatory markers, neuroimaging and neuropsychological data: a machine learning study
L Raffaelli ... I Bollettini
European Psychiatry | VOL. 66
L Raffaelli, et. al.L Raffaelli ... I Bollettini
01 Mar 2023
European Psychiatry | VOL. 66

Decision letter: Evidence for embracing normative modeling
Oscar Esteban ... Chris I Baker
-
Oscar Esteban, et. al.Oscar Esteban ... Chris I Baker
09 Jan 2023
09 Jan 2023

Editor's evaluation: Evidence for embracing normative modeling
Chris I Baker
-
Chris I BakerChris I Baker
09 Jan 2023
09 Jan 2023

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Systematic misestimation of machine learning performance in neuroimaging studies of depression

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Neuropsychopharmacology