Abstract

Scientific information extraction is a crucial step for understanding scientific publications. In this paper, we focus on scientific keyphrase extraction, which aims to identify keyphrases from scientific articles and classify them into predefined categories. We present a neural network based approach for this task, which employs the bidirectional long short-memory (LSTM) to represent the sentences in the article. On top of the bidirectional LSTM layer in our neural model, conditional random field (CRF) is used to predict the label sequence for the whole sentence. Considering the expensive annotated data for supervised learning methods, we introduce self-training method into our neural model to leverage the unlabeled articles. Experimental results on the ScienceIE corpus and ACL keyphrase corpus show that our neural model achieves promising performance without any hand-designed features and external knowledge resources. Furthermore, it efficiently incorporates the unlabeled data and achieve competitive performance compared with previous state-of-the-art systems.

Highlights

  • With the explosive increase of scientific publications, it is important for users to better understand the key ideas of the articles

  • Compare with hand-designed features and traditional discrete feature representation, it provides a different way to automatically learn dense features representation for text, such as words, phrases and sentences. Our method follows this line and builds the neural model based on the bidirectional long short-memory (LSTM) and conditional random field (CRF)

  • Scientific information extraction has attracted much attention in recent years, and it becomes the focus of SemEval 2017 Task10

Read more

Summary

Introduction

With the explosive increase of scientific publications, it is important for users to better understand the key ideas of the articles. The scientific keyphrases identification and classification is motivated by the increasing demand for efficiently finding relevant scientific publications and automatically understanding the key information of those publications, and it has received much academic interest over the past years [2,3,4,5,6]. The keyphrases with three categories (DOMAIN, TECHNIQUE and FOCUS) are annotated in this corpus These annotated datasets allow us to employ supervised machine learning methods for scientific keyphrase extraction. Compare with hand-designed features and traditional discrete feature representation, it provides a different way to automatically learn dense features representation for text, such as words, phrases and sentences Our method follows this line and builds the neural model based on the bidirectional long short-memory (LSTM) and conditional random field (CRF). Standard evaluation demonstrates that our neural model can achieve promising performance for scientific keyphrase extraction without any hand-designed features and external knowledge resources. Our model with self-training method can efficiently utilize unlabeled data, and achieve competitive performance compared with other state-of-the-art systems

Related work
Methods
Parameters Θ Initialization
Experiments
Results
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call