Content Classification based-on Latent Semantic Analysis and Support Vector Machine (LSA-SVM)

Gita Indah Marthasari,Nur Hayatin,Maulidya Yuniarti

doi:10.26623/transformatika.v19i2.2745

Gita Indah Marthasari, Nur Hayatin + Show 1 more

Open Access

https://doi.org/10.26623/transformatika.v19i2.2745

Copy DOI

Abstract

The diversity of the content of a web page can have a negative impact if used by the wrong user. Almost a half of internet users are children. Therefore, it is important to classify web pages to find out which pages are worthy of being seen by children and that are not feasible. One method that can be used is the Support Vector Machine (SVM) algorithm. SVM is a binary classification whose working principle is to find the best hyperplane to separate the two classes. To obtain better classification accuracy, the SVM is combined with the Latent Semantic Analysis (LSA) algorithm. The data used in this study were taken from the DMOZ web data which has been classified into two categories. The data is then entered into the pre-processing stage for further feature extraction using LSA. The LSA algorithm is used to find out the semantic similarities of words and text contained in web pages. The results of feature extraction are then classified using SVM with RBF kernel. Based on the testing result, we obtain a classification accuracy of 64%.

Full Text