Abstract

Learning by contrasting positive and negative samples is a general strategy adopted by many methods. Noise contrastive estimation (NCE) for word embeddings and translating embeddings for knowledge graphs are examples in NLP employing this approach. In this work, we view contrastive learning as an abstraction of all such methods and augment the negative sampler into a mixture distribution containing an adversarially learned sampler. The resulting adaptive sampler finds harder negative examples, which forces the main model to learn a better representation of the data. We evaluate our proposal on learning word embeddings, order embeddings and knowledge graph embeddings and observe both faster convergence and improved results on multiple metrics.

Highlights

  • Many models learn by contrasting losses on observed positive examples with those on some fictitious negative examples, trying to decrease some score on positive ones while increasing it on negative ones

  • To remedy the above mentioned problem of a fixed unconditional negative sampler, we propose to augment it into a mixture one, λpnce(y) + (1 − λ)gθ(y|x), where gθ is a conditional distribution with a learnable parameter θ and λ is a hyperparameter

  • We evaluate models trained from scratch as well as fine-tuned Glove models (Pennington et al, 2014) on word similarity tasks that consist of computing the similarity

Read more

Summary

Introduction

Many models learn by contrasting losses on observed positive examples with those on some fictitious negative examples, trying to decrease some score on positive ones while increasing it on negative ones. In noise contrastive estimation for word embeddings, a negative example is formed by replacing a component of a positive pair by randomly selecting a sampled word from the vocabulary, resulting in a fictitious word-context pair which would be unlikely to exist in the dataset This negative sampling by corruption approach is used in learning knowledge graph embeddings (Bordes et al, 2013; Lin et al, 2015; Ji et al, 2015; Wang et al, 2014; Trouillon et al, 2016; Yang et al, 2014; Dettmers et al, 2017), order embeddings (Vendrov et al, 2016), caption generation (Dai and Lin, 2017), etc. We demonstrate the efficacy and generality of the proposed method on three different learning tasks, word embeddings (Mikolov et al, 2013), order embeddings (Vendrov et al, 2016) and knowledge graph embeddings (Ji et al, 2015)

Background: contrastive learning
Adversarial mixture noise
Learning the generator
Entropy and training stability
Handling false negatives
Variance Reduction
Improving exploration in gθ by leveraging NCE samples
Related Work
Word Embeddings
Order Embeddings Hypernym Prediction
Knowledge Graph Embeddings
Experiments
Training Word Embeddings from scratch
Finetuning Word Embeddings
Hypernym Prediction
Ablation Study and Improving TransD
Limitations
Hard Negative Analysis
Findings
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.