An Introduction to Word Embeddings and Language Models

Tammie Borders,Svitlana Volkova

doi:10.2172/1773690

Abstract

Language models have advanced at a phenomenal pace over the past decade [1]. This document provides a short introduction to terminology, word embeddings (aka low-dimensional representations), and popular large-scale language models (LMs). Word embeddings are used to represent words as numerical vectors and are context-independent, meaning a word can only have a single representation (e.g., club can only be club sandwich, not golf club). Language models can determine the probability of a given sequence of words occurring in a sentence and can provide context to distinguish between words and phrases that sound similar. LMs are context-dependent (e.g., club can be club sandwich or golf club) and largely fall in two main classes – autoregressive and autoencoding models. Autoregressive models are pretrained on the classic language modeling task: guess the next token having read all the previous ones. Those models can be fine-tuned and achieve great results on many tasks, the most natural application is text generation. A typical example of such models is GPT, but others include GPT-2, GPT-3, CTLR, TRANSFORMER-XL, REFORMER, XLNET. Autoencoding models are pretrained by corrupting the input tokens in some way and trying to reconstruct the original sentence. They can be fine-tuned and achieve great results on many tasks such as text generation, but their most natural application is sentence classification or token classification. A typical example of such models is BERT, but others include ROBERTA, ALBERT, XML, XML-ROBERTA, FLAUBERT AND LONGFORMER.

Full Text