Learning Composition Models for Phrase Embeddings

Mo Yu,Mark Dredze

doi:10.1162/tacl_a_00135

Abstract

Lexical embeddings can serve as useful representations for words for a variety of NLP tasks, but learning embeddings for phrases can be challenging. While separate embeddings are learned for each word, this is infeasible for every phrase. We construct phrase embeddings by learning how to compose word embeddings using features that capture phrase structure and context. We propose efficient unsupervised and task-specific learning objectives that scale our model to large datasets. We demonstrate improvements on both language modeling and several phrase semantic similarity tasks with various phrase lengths. We make the implementation of our model and the datasets available for general use.

Highlights

We evaluate the perplexity of language models that include lexical embeddings and our composed phrase embeddings from Feature-rich Compositional Transformation (FCT) using the language modeling (LM) objective
We have presented FCT, a new composition model for deriving phrase embeddings from word embeddings
Compared to existing phrase composition models, FCT is very efficient and can utilize high dimensional word embeddings, which are crucial for semantic similarity tasks

Summary

Introduction

Word embeddings learned by neural language models (Bengio et al, 2003; Collobert and Weston, 2008; Mikolov et al, 2013b) have been successfully applied to a range of tasks, including syntax (Collobert and Weston, 2008; Turian et al, 2010; Collobert, 2011) and semantics (Huang et al, 2012; Socher et al, 2013b; Hermann et al, 2014). We propose a new method for compositional semantics that learns to compose word embeddings into phrases. This work suffers from two primary disadvantages These methods have high computational complexity for dense embeddings: O(d2) or O(d3) for composing every two components with d dimensions. The high computational complexity restricts these methods to use very low-dimensional embeddings (25 or 50). This work cannot utilize contextual features of phrases and still poses scaling challenges

Methods

Results

Conclusion