Automated FreeSurfer segmentation and visual quality control in 10,000 MRI scans from a large memory clinic cohort

Marieke Ribberink,Aniek M. van Gils,Jana M. Baumann,Anouk den Braber,Shalina Saddal,Bart Kuijper,Emma M. Coomans,Sophie E Mastenbroek,Suzie Kamps,Femke H. Bouwman,Annemartijn A. J. M. van Unnik,Sophie P.H. Bouman,Niels Reijner,Dimas Smit,Luigi Lorenzini,Eleonora M. Vromen,Wiesje M. van der Flier,Rebecca Z. Rousset,Joost Heuvelink,Julia M.L. van Veen,Lianne M. Reus,Ellen Hanna Singleton,Frederik Barkhof,Chenyang Jiang,Vikram Venkatraghavan,Pieter Jelle Visser,Yolande A.L. Pijnenburg,Diana I. Bocancea,Sophie M. van der Landen,Betty M. Tijms,Caro M. Kluin,Roos M. Rikken,Diederick Martijn de Leeuw,Lotta Bekkers,Margarita Georgallidou,Eline Verhagen,Sterre C.M. de Boer,Afina W. Lemstra

doi:10.1002/alz.081841

Abstract

AbstractBackgroundAutomated image segmentation methods together with increasing computer power allow quantifying brain alterations in great detail in large neuroimaging datasets. The increasing size of data complicates visual quality control (QC), which is the gold standard. Increasingly often, research relies on automated QC methods. Here, we aimed to investigate the performance of automated FreeSurfer segmentation through visual QC in a large memory clinic cohort with more than ten thousand images, and to compare an automated QC measure to the visual QC.Method10,400 T1‐weighted MRI scans from the Amsterdam Dementia Cohort were segmented with FreeSurfer(v7.1). Quality control was performed using an adapted version of the Enigma QC protocol[https://doi.org/10.1038/ng.2250]. Twenty‐six individuals visually assessed cortical segmentations, rating each as “fail”, “moderate”, or “pass”, identifying failure reasons and affected lobes. We investigated the error occurrence in failed or moderate quality scans, and whether segmentation failures depend on clinical diagnosis in a subset of 4990 baseline scans with most common diagnoses (i.e., SCD, MCI, AD, FTD, VaD and DLB). We compared an automated QC measure (i.e. median ± 3*IQR thresholded SurfaceHoles) with the visual QC output.ResultThe majority (78.3%) of 10,400 cortical segmentations were rated as having “pass”‐quality, 16.2% were “moderate”‐quality and 5.4% segmentations were rated as “fail”. Concordance between Automated QC and Visual QC was high for pass ratings(84%), but low for fail ratings(22.5%) (Table‐1). Within failed segmentations, most common reasons were processing errors(51.7%), image artifacts(14.3%) and underestimation(13.3%); for moderate ratings reasons were inclusion of meninges (47.8%) and underestimation of cortical thickness(33.2%) (Figure‐1). Stratified per diagnosis, segmentation failed in 2.3% of the SCD scans, 3% of MCI, 4.7% of DLB, 5.4% of AD, 7% of FTD and 14.5% of VaD. Further, we observed variation in affected lobes among diagnostic groups (Figure‐2b).ConclusionThe majority of scans passed visual QC, in high concordance with the automated QC. Images with moderate or failed quality, occurred more often in VaD and FTD. The most common mis‐segmentations were overestimation and underestimation of cortical thickness. This very large visually quality controlled data could be used as a benchmark to test future automated QC pipelines on.

Full Text