Language model as an Annotator: Unsupervised context-aware quality phrase generation

Zhihao Zhang,Yuan Zuo,Chenghua Lin,Junjie Wu

doi:10.1016/j.knosys.2023.111175

Abstract

Phrase mining is a fundamental text mining task that aims to identify quality phrases from context. Nevertheless, the scarcity of extensive gold labels datasets, demanding substantial annotation efforts from experts, renders this task exceptionally challenging. Furthermore, the emerging, infrequent, and domain-specific nature of quality phrases present further challenges in dealing with this task. Therefore, in this paper, we propose LMPhrase, a novel unsupervised context-aware quality phrase mining framework built upon large pre-trained language models (LMs). Specifically, we first mine quality phrases as silver labels by employing a parameter-free probing technique called Perturbed Masking on the pre-trained language model BERT (coined as Annotator). In contrast to typical statistic-based or distantly-supervised methods, our silver labels, derived from large pre-trained language models, take into account rich contextual information contained in the LMs. As a result, they bring distinct advantages in preserving informativeness, concordance, and completeness of quality phrases. Secondly, training a discriminative span prediction model heavily relies on massive annotated data and is likely to face the risk of overfitting silver labels. Alternatively, motivated by recent success in formulating language understanding problems such as named entity recognition and sentiment analysis as generation tasks, we formalize phrase tagging task as the sequence generation problem by directly fine-tuning on the Sequence-to-Sequence (Seq2Seq) pre-trained language model BART with silver labels (coined as Generator). Finally, we merge the quality phrases from both the Annotator and Generator as the final predictions, considering their complementary nature and distinct characteristics. Extensive experiments show that our LMPhrase consistently outperforms all the existing competitors across two different granularity phrase mining tasks, where each task is tested on two different domain datasets. The promising results show the superiority of our framework with pre-trained language model.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Language model as an Annotator: Unsupervised context-aware quality phrase generation

Abstract

Talk to us

Similar Papers

More From: Knowledge-Based Systems

Lead the way for us

Journal: Knowledge-Based Systems	Publication Date: Nov 16, 2023
Citations: 3

Similar Papers

Towards an Enhanced Understanding of Bias in Pre-trained Neural Language Models: A Survey with Special Emphasis on Affective Bias
Anoop K ... Lajish V L
-
Anoop K, et. al. Anoop K ... Lajish V L
01 Jan 2021
01 Jan 2021

Understanding latent affective bias in large pre-trained neural language models
Anoop Kadan ... Lajish V.L
Natural Language Processing Journal | VOL. 7
Anoop Kadan, et. al.Anoop Kadan ... Lajish V.L
05 Mar 2024
Natural Language Processing Journal | VOL. 7

A self-supervised language model selection strategy for biomedical question answering
Negar Arabzadeh ... Ebrahim Bagheri
Journal of Biomedical Informatics | VOL. 146
Negar Arabzadeh, et. al.Negar Arabzadeh ... Ebrahim Bagheri
16 Sep 2023
Journal of Biomedical Informatics | VOL. 146

On the comparability of pre-trained language models
...
-
, et. al. ...
25 Jun 2020
25 Jun 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Language model as an Annotator: Unsupervised context-aware quality phrase generation

Abstract

Talk to us

Similar Papers

More From: Knowledge-Based Systems