A Case Study on Development of Fraud Dictionary about Course Reviews for Quality Control of Vocational Training

Sun Jeong Jeong,Min He Shin,Kyung Hwa Rim,Eun Hye, Lee

doi:10.37210/jver.2017.36.3.67

Abstract

본 연구는 정부가 지원하는 직업훈련의 품질을 관리하기 위해 훈련생이 훈련프로그램을 이수한 이후 작성한 수강후기에 대해 부정내역이 담긴 수강후기를 자동으로 검출해 주는 부정사전을 개발하는데 목적이 있다. 이를 위해 고용노동부가 지원하는 2015년 훈련프로그램의 수강후기 자료(2015.1～2015.12, 118,879건)를 토대로 부정사전을 개발하고, 가장 최근 1년간의 수강후기 자료(2015.7～2016.6, 88,816건)에 부정사전을 적용하여 부정 수강후기를 추출하였다. 본 연구의 주요 결과는 첫째, 비정형화된 텍스트 데이터인 훈련생의 수강후기에 대해 의미분석과 오피니언 마이닝 기법을 적용하여 6단계의 개발절차에 따라 수강후기 부정사전을 개발하였다. 부정사전은 1,278개의 부정단어로 구성 되어있고, 부정단어를 형태와 의미를 고려하여 68개 소분류, 28개 중분류, 3개 대분류로 구성되었으며, 분류 단위에 따라 의미별 가중치를 부여하는 모델을 선정하였다. 전체 수강후기(118,879건)에 대해 사람이 직접 부정 수강후기를 추출할 때는 0.7%(788건)가 추출되었고(8주 소요), 부정사전을 적용할 때는 0.6%(668건)가 추출되어, 부정사전이 사람이 직접 수행할 때와 유사하게 부정 수강후기를 추출하였다. 둘째, 가장 최근에 입력된 1년간의 수강후기 자료에 부정사전 모델을 적용한 결과, 1차적으로 전체 수강후기의 7.2%(6,413건)를 부정 수강후기로 검출하였고, 이를 의미 분석한 결과 최종적으로 0.6%(499건)가 부정내역이 담긴 수강후기로 추출되었다(1주 소요). 이 연구의 의의는 첫째, 직업훈련의 품질을 관리하는데 있어 훈련프로그램에 참여한 훈련생의 주관적인 의견인 빅데이터로 구성된 방대한 수강후기를 오피니언 마이닝 기법을 적용하여 부정의견을 분류하는 부정사전을 개발하여 행정조치 등으로 연계하였다는 점에 큰 의의가 있다. 둘째, 기존에 사람이 직접 부정 수강후기를 추출할 때는 8주의 시간이 소요되었지만, 부정사전을 적용할 때는 1주 밖에 걸리지 않아, 업무수행의 효율성이 크게 증대되었다. 셋째, 오피니언 마이닝 기법을 적용하기 이전에 의미분석 과정을 거침으로써 부정단어 추출의 오차를 최소화하였다.The purpose of this study is to develop a fraud dictionary about course reviews that automatically detects the lecture statements pointing out fraudful training after the trainees completed the training program to manage the quality of the vocational training supported by the government. In order to do this, the fraud dictionary was developed based on the course reviews data of the training program for 2015(2015.1～2015.12, 118,879 cases), and the fraud dictionary was applied to extract lecture statements pointing out fraudful training among course reviews for the latest one year(2015.7～2016.6, 88,816 cases). The finding of the study were as follows. First, we developed the fraud dictionary about course reviews according to the six-step development procedure by applying the semantic analysis and opinion mining techniques to the informative text data. The fraud dictionary consists of 1,278 fraud words. The fraud word is composed of 68 subdivisions, 28 subdivisions, and 3 subdivisions in consideration of form and meaning. We selected a model that assigns semantic weighting according to classification unit. Among the whole course reviews(118,879 cases), 788 cases(0.7%) were extracted as the lecture statements pointing out fraudful training by the person within eight weeks, and 668 cases(0.6%) were extracted as one by the fraud dictionary. The fraud dictionary was extracted in a similar way to the one performed directly by the person. Second, as a result of applying the fraud dictionary model to the most recent one-year course review data, 6,413 cases(7.2%) of the whole course reviews were detected as lecture statements pointing out fraudful training. And as a result of applying the semantic analysis of them, 499 cases(0.6%) were finally extracted as lecture statements pointing out fraudful training within only one week.

Full Text