Abstract

Pre-trained word embeddings are used in several downstream applications as well as for constructing representations for sentences, paragraphs and documents. Recently, there has been an emphasis on improving the pretrained word vectors through post-processing algorithms. One improvement area is reducing the dimensionality of word embeddings. Reducing the size of word embeddings can improve their utility in memory constrained devices, benefiting several real world applications. In this work, we present a novel technique that efficiently combines PCA based dimensionality reduction with a recently proposed post-processing algorithm (Mu and Viswanath, 2018), to construct effective word embeddings of lower dimensions. Empirical evaluations on several benchmarks show that our algorithm efficiently reduces the embedding size while achieving similar or (more often) better performance than original embeddings. We have released the source code along with this paper.

Highlights

  • Word embeddings such as Glove (Pennington et al, 2014) and word2vec Skip-Gram (Mikolov et al, 2013) obtained from unlabeled text corpora can represent words in distributed dense realvalued low dimensional vectors which geometrically capture the semantic ‘meaning’ of a word

  • Spearman’s rank correlation coefficient (Rho × 100) between the ranks produced by using the word vectors and the human rankings is used for the evaluation

  • Evaluation Results: First we evaluate our algorithm on the same embeddings against the 3 baselines, we evaluate our algorithm across word embeddings of different dimensions and different types

Read more

Summary

Introduction

Word embeddings such as Glove (Pennington et al, 2014) and word2vec Skip-Gram (Mikolov et al, 2013) obtained from unlabeled text corpora can represent words in distributed dense realvalued low dimensional vectors which geometrically capture the semantic ‘meaning’ of a word. These embeddings capture several linguistic regularities such as analogy relationships. A major issue related with word embeddings is their size (Ling et al, 2016), e.g., loading a word embedding matrix of 2.5 M tokens takes up to 6 GB memory (for 300-dimensional vectors, on a 64-bit system). In this work we combine the simple dimensionality reduction technique, PCA with the post processing technique of (Mu and Viswanath, 2018), as discussed above

Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.