Automatic Keyphrase Extraction from Scientific Chinese Medical Abstracts Based on Character-Level Sequence Labeling

Liangping Ding,Gaihong Yu,Jie Li,Huan Liu,Zhixiong Zhang

doi:10.2478/jdis-2021-0013

Abstract

AbstractPurposeAutomatic keyphrase extraction (AKE) is an important task for grasping the main points of the text. In this paper, we aim to combine the benefits of sequence labeling formulation and pretrained language model to propose an automatic keyphrase extraction model for Chinese scientific research.Design/methodology/approachWe regard AKE from Chinese text as a character-level sequence labeling task to avoid segmentation errors of Chinese tokenizer and initialize our model with pretrained language model BERT, which was released by Google in 2018. We collect data from Chinese Science Citation Database and construct a large-scale dataset from medical domain, which contains 100,000 abstracts as training set, 6,000 abstracts as development set and 3,094 abstracts as test set. We use unsupervised keyphrase extraction methods including term frequency (TF), TF-IDF, TextRank and supervised machine learning methods including Conditional Random Field (CRF), Bidirectional Long Short Term Memory Network (BiLSTM), and BiLSTM-CRF as baselines. Experiments are designed to compare word-level and character-level sequence labeling approaches on supervised machine learning models and BERT-based models.FindingsCompared with character-level BiLSTM-CRF, the best baseline model with F1 score of 50.16%, our character-level sequence labeling model based on BERT obtains F1 score of 59.80%, getting 9.64% absolute improvement.Research limitationsWe just consider automatic keyphrase extraction task rather than keyphrase generation task, so only keyphrases that are occurred in the given text can be extracted. In addition, our proposed dataset is not suitable for dealing with nested keyphrases.Practical implicationsWe make our character-level IOB format dataset of Chinese Automatic Keyphrase Extraction from scientific Chinese medical abstracts (CAKE) publicly available for the benefits of research community, which is available at:https://github.com/possible1402/Dataset-For-Chinese-Medical-Keyphrase-Extraction.Originality/valueBy designing comparative experiments, our study demonstrates that character-level formulation is more suitable for Chinese automatic keyphrase extraction task under the general trend of pretrained language models. And our proposed dataset provides a unified method for model evaluation and can promote the development of Chinese automatic keyphrase extraction to some extent.

Highlights

Automatic keyphrase extraction (AKE) is a task to extract important and topical phrases from the body of a document (Turney, 2000), which is the basis of information retrieval (Jones & Staveley, 1999), text summarization (Zhang, Zincir-Heywood, & Milios, 2004), text categorization (Hulth & Megyesi, 2006), opinion mining (Berend, 2011), and document indexing (Frank et al, 1999)
By designing comparative experiments, our study demonstrates that character-level formulation is more suitable for Chinese automatic keyphrase extraction task under the general trend of pretrained language models
The reason might be that BiLSTM-Conditional Random Field (CRF) is a more powerful model to capture the contextual relationship among characters to make up for the disadvantage that character-level formulation doesn’t model the relationship among words directly

Summary

Introduction

Automatic keyphrase extraction (AKE) is a task to extract important and topical phrases from the body of a document (Turney, 2000), which is the basis of information retrieval (Jones & Staveley, 1999), text summarization (Zhang, Zincir-Heywood, & Milios, 2004), text categorization (Hulth & Megyesi, 2006), opinion mining (Berend, 2011), and document indexing (Frank et al, 1999). It can help us quickly go through large amounts of textual information to find out the main stating point of the text. Some researchers reformulate keyphrase extraction as a sequence labeling task and validate the effectiveness of this formulation

Objectives

Methods

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Data and Information Science	Publication Date: Mar 2, 2021
Citations: 6	License type: CC BY-NC-ND 4.0

R Discovery Prime

R Discovery Prime

Automatic Keyphrase Extraction from Scientific Chinese Medical Abstracts Based on Character-Level Sequence Labeling

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Data and Information Science

Lead the way for us

Similar Papers

A Support Vector Machines Approach to Vietnamese Key Phrase Extraction
Chau Q Nguyen ... Luan T Hong
-
Chau Q Nguyen, et. al.Chau Q Nguyen ... Luan T Hong
01 Jan 2009
01 Jan 2009

Comparative Analysis on Automatic Keyphrase Extraction (AKPE) Techniques
Dr R.Mangai Begum ... J.S Baruni
INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT | VOL. 08
Dr R.Mangai Begum, et. al.Dr R.Mangai Begum ... J.S Baruni
14 Jul 2024
INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT | VOL. 08

Enhancing keyphrase extraction from academic articles with their reference information
Chengzhi Zhang ... Lei Zhao
Scientometrics | VOL. 127
Chengzhi Zhang, et. al.Chengzhi Zhang ... Lei Zhao
31 Jan 2022
Scientometrics | VOL. 127

Automatic keyphrase extraction and segmentation of video lectures
Arun Balagopalan ... Aswin Damodar
-
Arun Balagopalan, et. al.Arun Balagopalan ... Aswin Damodar
01 Jan 2012
01 Jan 2012

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Automatic Keyphrase Extraction from Scientific Chinese Medical Abstracts Based on Character-Level Sequence Labeling

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Data and Information Science