Semi-supervised learning approach for Indonesian Named Entity Recognition (NER) using co-training algorithm

Bayu Aryoyudanta,Indriana Hidayah,Teguh Bharata Adji

doi:10.1109/isitia.2016.7828624

Abstract

The problem of utilizing machine learning approach in Indonesian Named Entity Recognition (NER) system is the limited amount of labelled data for training process. However, unlike the limited availability of labelled data, unlabelled data is widely available from many sources. This enables a semi-supervised learning approach to solve this NER system problem. This research aims to design a semi-supervised learning model to solve NER system problem. A semi-supervised co-training learning is used to utilize unlabelled data in NER learning process to produce new labelled data that can be applied to enhance a new NER classi□cation system. This research uses two kinds of data, Indonesian DBPedia data as labelled data and news article text from Indonesian news sites (kompas.com, cnnindonesia.com, tempo.co, merdeka.com and viva.co.id) as unlabelled data. The pre-processing steps applied to analyze unstructured text are sentence segmentation, tokenization, stemming, and PoS Tagging. The results of this pre-process are the NER and its context used as unlabelled data for the semi-supervised co-training process. The SVM algorithm is used as a classi□cation algorithm in this process. 10 Cross Fold Validation is used as the system testing approach. Based on the result of the NER testing system, the precision is 73.6%, the recall is 80.1% and f1 mean is 76.5%.

Full Text