Deep neural model with self-training for scientific keyphrase extraction

Xun Zhu,Chen Lyu,Fei Li,Donghong Ji,Han Liao,Weinan Zhang

doi:10.1371/journal.pone.0232547

Abstract

Scientific information extraction is a crucial step for understanding scientific publications. In this paper, we focus on scientific keyphrase extraction, which aims to identify keyphrases from scientific articles and classify them into predefined categories. We present a neural network based approach for this task, which employs the bidirectional long short-memory (LSTM) to represent the sentences in the article. On top of the bidirectional LSTM layer in our neural model, conditional random field (CRF) is used to predict the label sequence for the whole sentence. Considering the expensive annotated data for supervised learning methods, we introduce self-training method into our neural model to leverage the unlabeled articles. Experimental results on the ScienceIE corpus and ACL keyphrase corpus show that our neural model achieves promising performance without any hand-designed features and external knowledge resources. Furthermore, it efficiently incorporates the unlabeled data and achieve competitive performance compared with previous state-of-the-art systems.

Highlights

With the explosive increase of scientific publications, it is important for users to better understand the key ideas of the articles
Compare with hand-designed features and traditional discrete feature representation, it provides a different way to automatically learn dense features representation for text, such as words, phrases and sentences. Our method follows this line and builds the neural model based on the bidirectional long short-memory (LSTM) and conditional random field (CRF)
Scientific information extraction has attracted much attention in recent years, and it becomes the focus of SemEval 2017 Task10

Summary

Introduction

With the explosive increase of scientific publications, it is important for users to better understand the key ideas of the articles. The scientific keyphrases identification and classification is motivated by the increasing demand for efficiently finding relevant scientific publications and automatically understanding the key information of those publications, and it has received much academic interest over the past years [2,3,4,5,6]. The keyphrases with three categories (DOMAIN, TECHNIQUE and FOCUS) are annotated in this corpus These annotated datasets allow us to employ supervised machine learning methods for scientific keyphrase extraction. Compare with hand-designed features and traditional discrete feature representation, it provides a different way to automatically learn dense features representation for text, such as words, phrases and sentences Our method follows this line and builds the neural model based on the bidirectional long short-memory (LSTM) and conditional random field (CRF). Standard evaluation demonstrates that our neural model can achieve promising performance for scientific keyphrase extraction without any hand-designed features and external knowledge resources. Our model with self-training method can efficiently utilize unlabeled data, and achieve competitive performance compared with other state-of-the-art systems

Related work

Methods

Parameters Θ Initialization

Experiments

Results

Conclusions