This study proposes depression detection systems based on the i-vector framework for classifying speakers as depressed or healthy and predicting depression levels according to the Beck Depression Inventory-II (BDI-II). Linear and non-linear speech features are investigated as front-end features to i-vectors. To take advantage of the complementary effects of features, i-vector systems based on linear and non-linear features are combined through the decision-level fusion. Variability compensation techniques, such as Linear Discriminant Analysis (LDA) and Within-Class Covariance Normalization (WCCN), are widely used to reduce unwanted variabilities. A more generalizable technique than the LDA is required when limited training data are available. We employ a support vector discriminant analysis (SVDA) technique that uses the boundary of classes to find discriminatory directions to address this problem. Experiments conducted on the 2014 Audio-Visual Emotion Challenge and Workshop (AVEC 2014) depression database indicate that the best accuracy improvement obtained using SVDA is about 15.15% compared to the uncompensated i-vectors. In all cases, experimental results confirm that the decision-level fusion of i-vector systems based on three feature sets, TEO-CB-Auto-Env+Δ, Glottal+Δ, and MFCC+Δ+ΔΔ, achieves the best results. This fusion significantly improves classifying results, yielding an accuracy of 90%. The combination of SVDA-transformed BDI-II score prediction systems based on these three feature sets achieved RMSE and MAE of 8.899 and 6.991, respectively, which means 29.18% and 30.34% improvements in RMSE and MAE, respectively, over the baseline system on the test partition. Furthermore, this proposed combination outperforms other audio-based studies available in the literature using the AVEC 2014 database.
Read full abstract