Abstract

Depression has become a serious mental health issue worldwide, particularly due to the rise of the Global Pandemic. Identifying depression of an individual from short texts shared in social media is a challenging task. The present work aims to select the optimal feature subset for classifying short texts for depression detection. By performing feature selection, it is possible to eliminate redundant and noisy features in high-dimensional datasets with small sample sizes. This can prevent the ”curse of dimensionality” and enhance the effectiveness of classification algorithms. However, current feature selection methods often focus on optimizing classification or clustering performance, while neglecting the stability of the selected features. This can lead to unstable results and make it challenging to identify meaningful and interpretable features. This paper introduces a novel embedded feature selection approach named Statistical Relevance Class Frequency based on Whale Optimization Algorithm (SRCF-WOA) for selecting feature subsets from short texts in social media. The proposed methodology extracts both the unigram features and composite features to capture the semantic and structural information. χ2.rcf (Chi-squared relevance class frequency) filter approach is applied to rank the extracted features to signify the importance of the features. WOA is adapted to retrieve the optimal subset of features with low-dimensional space using its high exploration and high exploitation capability. In the evaluation process, four benchmark short text datasets and two classifiers are used. The comparison shows that the proposed embedded feature selection method outperforms other algorithms in terms of accuracy and Fβ scores(β=0.5,1, and 2). The sensitivity analysis is carried out to check the robustness and stability of the proposed method. The findings indicate that the SRCF-WOA surpasses other methods on the majority of datasets, achieving the maximum classification accuracy while utilizing the minimal features. The statistical importance of these findings is further supported by the Analysis of Variance (ANOVA) F-test. Moreover, the proposed method strikes the optimal balance between classification accuracy and feature stability.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call