Abstract

Feature space high dimensionality is a well-known problem in text classification and web mining domains, it is caused mainly by the large number of vocabularies contained within web documents. Several methods were applied to select the most useful and important features over the years; however, the performance of such methods is still improvable from different aspects such as the computational cost and accuracy. This research presents an enhanced cosine similarity-based hybridization of two efficient feature selection methods for higher classification performance. The reduced feature sets are generated using the Random Projection (RP) and the Principal Component Analysis (PCA) methods, individually, then hybridized based on the cosine similarity values between features’ vectors. The performance of the proposed method in terms of accuracy and F-measure was tested on a dataset of web pages based on several term weighting schemes. As compared to relevant methods, results of the proposed method show significantly higher accuracy and f-measure performance based on less feature set size. Index Terms— Cosine similarity, Dimensionality Reduction, Feature selection, PCA, Random Projection.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.