Word Embedding Binarization with Semantic Information Preservation

Samarth Navali,Praneet Sherki,Vanraj Vala,Ramesh Inturi

doi:10.18653/v1/2020.coling-main.108

Abstract

With growing applications of Machine Learning in daily lives Natural Language Processing (NLP) has emerged as a heavily researched area. Finding its applications in tasks ranging from simple Q/A chatbots to Fully fledged conversational AI, NLP models are vital. Word and Sentence embedding are one of the most common starting points of any NLP task. A word embedding represents a given word in a predefined vector-space while maintaining vector relations with similar or dis-similar entities. As such different pretrained embedding such as Word2Vec, GloVe, fasttext have been developed. These embedding generated on millions of words are however very large in terms of size. Having embedding with floating point precision also makes the downstream evaluation slow. In this paper we present a novel method to convert continuous embedding to its binary representation, thus reducing the overall size of the embedding while keeping the semantic and relational knowledge intact. This will facilitate an option of porting such big embedding onto devices where space is limited. We also present different approaches suitable for different downstream tasks based on the requirement of contextual and semantic information. Experiments have shown comparable result in downstream tasks with 7 to 15 times reduction in file size and about 5 % change in evaluation parameters.

Highlights

Natural Language Processing (NLP) is quickly becoming one of the most important branch in Machine Learning field, with companies pouring millions to perfect their NLP engines
NLP is the cornerstone in the road to developing a perfect conversational AI
In this paper we propose two methods to binarize a pre-trained word embedding, while keeping all the semantic information intact

Summary

Introduction

Natural Language Processing (NLP) is quickly becoming one of the most important branch in Machine Learning field, with companies pouring millions to perfect their NLP engines. NLP finds its way in more trivial but important tasks in modern life, be it a simple chatbot, a Q/A site, Document Classifier. For most of these tasks, an embedding model or a pre-trained embedding is the perfect starting point. A word embedding is a collection of words, represented as vectors in a predefined space. Within this space the vectors follow all typical vector laws. Word Embeddings are very important for downstream tasks like document classification, query response, etc. Word2Vec (Mikolov et al, 2013), Glove (Pennington et al, 2014), FastText (Bojanowski et al, 2016) are some of the best generic embeddings available that are trained on millions of vocabularies

Objectives

Methods

Results

Conclusion