Abstract

Robust classification algorithms for high-dimensional, small-sample datasets are valuable in practical applications. Faced with the infrared spectroscopic dataset with 568 samples and 3448 wavelengths (features) to identify the origins of Chinese medicinal materials, this paper proposed a novel embedded multiclassification algorithm, ITabNet, derived from the framework of TabNet. Firstly, a refined data pre-processing (DP) mechanism was designed to efficiently find the best adaptive one among 50 DP methods with the help of Support Vector Machine (SVM). Following this, an innovative focal loss function was designed and joined with a cross-validation experiment strategy to mitigate the impact of sample imbalance on algorithm. Detailed investigations on ITabNet were conducted, including comparisons of ITabNet with SVM for the conditions of DP and Non-DP, GPU and CPU computer settings, as well as ITabNet against XGBT (Extreme Gradient Boosting). The numerical results demonstrate that ITabNet can significantly improve the effectiveness of prediction. The best accuracy score is 1.0000, and the best Area Under the Curve (AUC) score is 1.0000. Suggestions on how to use models effectively were given. Furthermore, ITabNet shows the potential to apply the analysis of medicinal efficacy and chemical composition of medicinal materials. The paper also provides ideas for multi-classification modeling data with small sample size and high-dimensional feature.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call