Terahertz time-domain spectroscopy (THz-TDS) has been widely used for food and drug identification. The classification information of a THz spectrum usually does not exist in the whole spectral band but exists only in one or several small intervals. Therefore, feature selection is indispensable in THz-based substance identification. However, most THz-based identification methods empirically intercept the low-frequency band of the THz absorption coefficients for analysis. In order to adaptively find out important intervals of the THz spectra, an interval-based sparse ensemble multi-class classifier (ISEMCC) for THz spectral data classification is proposed. In ISEMCC, the THz spectra are first divided into several small intervals through window sliding. Then the data of training samples in each interval are extracted to train some base classifiers. Finally, a final robust classifier is obtained through a nonnegative sparse combination of these trained base classifiers. With l1 -norm, two objective functions that based on Mean Square Error (MSE) and Cross Entropy (CE) are established. For these two objective functions, two iterative algorithms based on the Alternating Direction Method of Multipliers (ADMM) and Gradient Descent (GD) are built respectively. ISEMCC transforms the problem of interval feature selection and decision-level fusion into a nonnegative sparse optimization problem. The sparse constraint ensures only a few important spectral segments are selected. In order to verify the performance of the proposed algorithm, comparative experiments on identifying the origin of Bupleurum and the harvesting year of Tangerine peel are carried out. The base classifiers used by ISEMCC are Support Vector Machine (SVM) and Decision Tree (DT). The experimental results demonstrate that the proposed algorithm outperforms six typical classifiers, including Random Forest (RF), AdaBoost, RUSBoost, ExtraTree, and the two base classifiers, in terms of classification accuracy.
Read full abstract