Abstract

Text data usage has increased rapidly and simultaneously resulted in setbacks, such as high dimensionality of text data becoming a prominent problem. Hence, this study aimed to assess the application of filter feature selection techniques on text data. In this study, features were ranked from highest to lowest by the selected filter feature selection methods in each generated feature subset. Thereafter, a new feature subset was obtained using the proposed method. This study yielded that the accuracy of Information Gain is 1.12 percentage points higher in comparison to the accuracy of Chi-square. Moreover, classification accuracy obtained from aggregation exhibits a rise of 0.93 percentage points compared to the accuracy of Information Gain and 2.05 percentage points against Chi-square. Classification accuracy improved when the features are aggregated. On Precision, in comparison to that of the aggregation, results show the differences in percentage points of 1.41 and significant 11.64 for Information Gain and Chi-square respectively. About Recall, there is a 5.54 percentage points improvement on Information Gain and 3.03 percentage points improvement on Chi-square. Then, in F1, the score for aggregation is quite low. It may mean that the classifier has problems with false positives or false negatives. Thus, the classifier needs to be checked using a confusion matrix or check on the dataset, which was not done in the experiment. Dataset imbalance was also not addressed in this study. For future work, the imbalanced class-dataset issue should be addressed. Also, the performance of other filter methods could be compared as well as utilize other classifiers that support multiclass tasks to determine which is suitable for multiclass text classification.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.