Abstract
Obtaining high-quality embeddings of out-of-vocabularies (OOVs) and low-frequency words is a challenge in natural language processing (NLP). To efficiently estimate the embeddings of OOVs and low-frequency words, we propose a new method that uses the dictionary to estimate the embeddings of OOVs and low-frequency words. More specifically, the explanatory note of an entry in dictionaries accurately describes the semantics of the corresponding word. Naturally, we adopt the sentence representation model to extract the semantics of the explanatory note and regard the semantics as the embedding of the corresponding word. We design a new sentence representation model to encode sentences to extract the semantics from the explanatory notes of entries more efficiently. Based on the assumption that the higher quality of word embeddings will lead to better performance, we design an extrinsic experiment to evaluate the quality of low-frequency words’ embeddings. The experimental results show that the embeddings of low-frequency words estimated by our proposed method have higher quality. In addition, both intrinsic and extrinsic experiments show that our proposed sentence representation model can represent the semantics of sentences well.
Highlights
We propose a new sentence representation model which is different from the current mainstream language model (LM), such as BERT [1], XLNet [19], and GPT [2,20]
BERTmax means that the max-pooling of the encodings in BERT’s last layer is treated as the representation of the input sentence, and BERTmean means that the mean-pooling of the encodings in BERT’s last layer is treated as the representation of the input sentence. f asttextmax and f asttextmean are the same as BERTmax and BERTmean
The overall performance of BERT model is worst, which shows that BERT needs further fine-tuning to obtain better performance in downstream tasks. f asttextcls has achieved second only to us, and its performance has surpassed LASER
Summary
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. Word embeddings contain semantics and other information learned from the large-scale corpora. Recent works have demonstrated substantial gains on many natural language processing (NLP) tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task [1,2]. Many machine learning methods use pretrained word embeddings as input and achieve better performance in many NLP tasks [3], such as the well-known text classification [4,5,6] and neural machine translation [7,8,9], among others
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.