Background/Objectives: Thyroid nodules are a very common finding, mostly benign but sometimes malignant, and thus require accurate diagnosis. Ultrasound and fine needle biopsy are the most widely used and reliable diagnostic methods to date, but they are sometimes limited in addressing benign from malignant nodules, mainly with regard to ultrasound, by the operator’s experience. Radiomics, quantitative feature extraction from medical images and machine learning offer promising avenues to improve diagnosis. The aim of this work was to develop a machine learning model based on thyroid ultrasound images to classify nodules into benign and malignant classes. Methods: For this purpose, images of ultrasonography from 142 subjects were collected. Among these subjects, 40 patients (28.2%) belonged to the class “malignant” and 102 patients (71.8%) belonged to the class “benign”, according to histological diagnosis from fine-needle aspiration. This image set was used for the training, cross-validation and internal testing of three different machine learning models. A robust radiomic approach was applied, under the hypothesis that the radiomic feature could capture the disease heterogeneity among the two groups. Three models consisting of four ensembles of machine learning classifiers (random forests, support vector machines and k-nearest neighbor classifiers) were developed for the binary classification task of interest. The best performing model was then externally tested on a cohort of 21 new patients. Results: The best model (ensemble of random forest) showed Receiver Operating Characteristic-Area Under the Curve (ROC-AUC) (%) of 85 (majority vote), 83.7 ** (mean) [80.2–87.2], accuracy (%) of 83, 81.2 ** [77.1–85.2], sensitivity (%) of 70, 67.5 ** [64.3–70.7], specificity (%) of 88, 86.5 ** [82–91], positive predictive value (PPV) (%) of 70, 66.5 ** [57.9–75.1] and negative predictive value (NPV) (%) of 88, 87.1 ** [85.5–88.8] (* p < 0.05, ** p < 0.005) in the internal test cohort. It achieved an accuracy of 90.5%, a sensitivity of 100%, a specificity of 86.7%, a PPV of 75% and an NPV of 100% in the external testing cohort. Conclusions: The model constituted of four ensembles of random forest classifiers could identify all the malignant nodes and the consistent majority of benign in the external testing cohort.
Read full abstract