워드 임베딩과 단어 네트워크 분석을 활용한 비지도학습 기반의 문서 다중 범주 가중치 산출 : 휴대폰 리뷰 사례를 중심으로

Jaeyun Jeong,Kyoung Hyun Mo,Czang Yeob Kim,Seungwan Seo,Haedong Kim,Pilsung Kang

doi:10.7232/jkiie.2018.44.6.442

Abstract

Due to the increased amounts of online documents, there is a growing demand for text categorization that categorizes documents into predefined categories. Many approaches to this problem are based on supervised machine learning which couldn’t be applied to unlabeled data. However, large number of documents, such as online cell phone reviews, have no category information and key categories are not predefined. To solve these problems, we propose unsupervised document multi-labeling method based on word embedding and word network analysis. After embedding words in a lower dimensional space using Word2Vec technique, we generate a weight matrix by calculating similarities between words. We create a word network using this matrix and extract the key categories from this network. With key category-weight matrix and co-occurrence matrix, we generate a document-category score matrix. To verify our proposed method, we collect 298,206 cell phone reviews from four review websites. Then, we compared the results of the proposed method with labeled documents from human cognitive perspective.

Full Text