Feature Engineering with Word2vec on Text Classification Using The K-Nearest Neighbor Algorithm

Syopiansyah Jaya Putra,Muhamad Nur Gunawan,Arief Akbar Hidayat

doi:10.1109/citsm56380.2022.9935873

Abstract

Text feature extraction is the process of convering unstructured text data into structured so that machine learning algorithms can process it. One of the commonly used text feature extraction techniques is tf-idf. This technique has the potential to produce high-dimensional data which results in longer computational time and affects accuracy results. This study aims to compare feature extraction between word2vec and TF-IDF. The study uses a data explore 4 step approach with a text classification process whose modeling uses the KNN algorithm. The results showed that the highest accuracy value of TF-IDF with the KNN algorithm was 73% in the 7:3 scenario with 8133 features. The highest accuracy value of Wod2vec with the KNN algorithm was 74% in scenario 9: 1 with 300 features. IDF where word2vec produces data with fewer dimensions. This study can prove that feature extraction with word2vec can be done for machine learning research, not only for deep learning. This study can also be used as a comparison of classification per-formance measurement with different feature extraction which can later be applied in web or mobile apps.

Full Text