Breast cancer screening is time consuming, requires expensive equipment, and has demanding requirements for doctors. Hence, a large number of breast cancer patients may miss screening and early treatment, which greatly threatens their health around the world. Infrared spectroscopy may be able to be used as a screening tool for breast cancer detection. Fourier transform infrared (FT-IR) spectroscopy of serum was combined with traditional machine learning algorithms to achieve an auxiliary diagnosis that could quickly and accurately distinguish patients with different stages of breast cancer, including stage 1 disease, from control subjects without breast cancer. FT-IR spectroscopy were performed on the serum of 114 non-cancer control subjects, 35 patients with stage I, 43 patients with stage II, and 29 patients with stage III & IV breast cancer. Due to the experimental sample imbalance, we used the oversampling to process the four classes of sample. The oversampling selected Synthetic Minority Oversampling Technique (SMOTE). Subsequently, we used the random discarding method in undersampling to do experiments as well. The average FT-IR spectroscopy results for the four groups showed differences in phospholipids, nucleic acids, lipids, and proteins between non-cancer control subjects and breast cancer patients at different stages. Based on these differences, four classification models were used to classify stage I, II, III & IV breast cancer patients and non-cancer control subjects. First, standard normal variate transformation (SNV) was used to preprocess the original data, and then partial least squares (PLS) was used for feature extraction. Finally, the five models were established including extreme learning machine (ELM), k-nearest neighbor (KNN), genetic algorithms based on support vector machine (GA-SVM), particle swarm optimization-support vector machine (PSO-SVM) and grid search-support vector machine (GS-SVM). In oversampling experiment, the GS-SVM classifier obtained the highest average classification accuracy of 95.45 %; the diagnostic accuracy of non-cancer control subjects was 100 %; breast cancer stage I was 90 %; breast cancer stage II was 84.62 %; and breast cancer stage III & IV was 100 %. In undersampling experiment, the GA-SVM model obtained the highest average classification accuracy of 100 %; the diagnostic accuracy of non-cancer control subjects was 100 %; breast cancer stage I was 100 %; breast cancer stage II was 100 %; and breast cancer stage III & IV was 100 %. The results show that FT-IR spectroscopy combined with powerful classification algorithms has great potential in distinguishing patients with different stages of breast cancer from non-cancer control subjects. In addition, this research provides a reference for future multiclassification studies of cervical cancer, ovarian cancer and other female high-incidence cancers through serum FT-IR spectroscopy.
Read full abstract