Efficient Estimate of Low-Frequency Words’ Embeddings Based on the Dictionary: A Case Study on Chinese

Xianwen Liao,Changfu Wei,Ke Yi,Yongzhong Huang,Yongqing Deng,Chenhao Zhang

doi:10.3390/app112211018

Abstract

Obtaining high-quality embeddings of out-of-vocabularies (OOVs) and low-frequency words is a challenge in natural language processing (NLP). To efficiently estimate the embeddings of OOVs and low-frequency words, we propose a new method that uses the dictionary to estimate the embeddings of OOVs and low-frequency words. More specifically, the explanatory note of an entry in dictionaries accurately describes the semantics of the corresponding word. Naturally, we adopt the sentence representation model to extract the semantics of the explanatory note and regard the semantics as the embedding of the corresponding word. We design a new sentence representation model to encode sentences to extract the semantics from the explanatory notes of entries more efficiently. Based on the assumption that the higher quality of word embeddings will lead to better performance, we design an extrinsic experiment to evaluate the quality of low-frequency words’ embeddings. The experimental results show that the embeddings of low-frequency words estimated by our proposed method have higher quality. In addition, both intrinsic and extrinsic experiments show that our proposed sentence representation model can represent the semantics of sentences well.

Highlights

We propose a new sentence representation model which is different from the current mainstream language model (LM), such as BERT [1], XLNet [19], and GPT [2,20]
BERTmax means that the max-pooling of the encodings in BERT’s last layer is treated as the representation of the input sentence, and BERTmean means that the mean-pooling of the encodings in BERT’s last layer is treated as the representation of the input sentence. f asttextmax and f asttextmean are the same as BERTmax and BERTmean
The overall performance of BERT model is worst, which shows that BERT needs further fine-tuning to obtain better performance in downstream tasks. f asttextcls has achieved second only to us, and its performance has surpassed LASER

Summary

Introduction

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. Word embeddings contain semantics and other information learned from the large-scale corpora. Recent works have demonstrated substantial gains on many natural language processing (NLP) tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task [1,2]. Many machine learning methods use pretrained word embeddings as input and achieve better performance in many NLP tasks [3], such as the well-known text classification [4,5,6] and neural machine translation [7,8,9], among others

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Applied Sciences	Publication Date: Nov 21, 2021
Citations: 2	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Efficient Estimate of Low-Frequency Words’ Embeddings Based on the Dictionary: A Case Study on Chinese

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences

Lead the way for us

Similar Papers

Word Embeddings for Natural Language Processing

-

01 Jan 2015
01 Jan 2015

A Comparative Analysis of Word Embedding and Deep Learning for Arabic Sentiment Classification
Sahar F Sabbeh ... Heba A Fasihuddin
Electronics | VOL. 12
Sahar F Sabbeh, et. al.Sahar F Sabbeh ... Heba A Fasihuddin
16 Mar 2023
Electronics | VOL. 12

De-identification of clinical notes via recurrent neural network and conditional random field
Zengjian Liu ... Qingcai Chen
Journal of Biomedical Informatics | VOL. 75
Zengjian Liu, et. al.Zengjian Liu ... Qingcai Chen
01 Jun 2017
Journal of Biomedical Informatics | VOL. 75

Similar Cluster Based Continuous Bag-of-Words for Word Vector Training
Weikai Sun ... Shiyi Zhang
-
Weikai Sun, et. al.Weikai Sun ... Shiyi Zhang
01 Jan 2020
01 Jan 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Efficient Estimate of Low-Frequency Words’ Embeddings Based on the Dictionary: A Case Study on Chinese

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences