Balancing Cost and Benefit with Tied-Multi Transformers

Raj Dabre,Raphael Rubino,Atsushi Fujita

doi:10.18653/v1/2020.ngt-1.3

Abstract

We propose a novel procedure for training multiple Transformers with tied parameters which compresses multiple models into one enabling the dynamic choice of the number of encoder and decoder layers during decoding. In training an encoder-decoder model, typically, the output of the last layer of the N-layer encoder is fed to the M-layer decoder, and the output of the last decoder layer is used to compute loss. Instead, our method computes a single loss consisting of NxM losses, where each loss is computed from the output of one of the M decoder layers connected to one of the N encoder layers. Such a model subsumes NxM models with different number of encoder and decoder layers, and can be used for decoding with fewer than the maximum number of encoder and decoder layers. Given our flexible tied model, we also address to a-priori selection of the number of encoder and decoder layers for faster decoding, and explore recurrent stacking of layers and knowledge distillation for model compression. We present a cost-benefit analysis of applying the proposed approaches for neural machine translation and show that they reduce decoding costs while preserving translation quality.

Highlights

Neural networks for sequence-to-sequence modeling typically consist of an encoder and a decoder coupled via an attention mechanism
Recall that we aim at a flexible model and that all the results in Table 1 have been obtained using a single tied-multi model, albeit using different number of encoder and decoder layers for decoding
We examined the combination of our multi-layer softmaxing approach with another parameter-tying method in neural networks, called recurrent stacking (RS) (Dabre and Fujita, 2019), complemented by sequence-level knowledge distillation (Kim and Rush, 2016), a specific type of knowledge distillation (Hinton et al, 2015)

Summary

Introduction

Neural networks for sequence-to-sequence modeling typically consist of an encoder and a decoder coupled via an attention mechanism. The hyper-parameters can be tuned, for instance, through maximizing the automatic evaluation score on the development data. In general, it is highly unlikely (or impossible) that a single optimized model suffices diverse cost-benefit demands at the same time. A single optimized model cannot guarantee the best performance for each individual input. An existing solution for these problems is to train multiple models and host them simultaneously. This approach is not very practical, because it requires a large number of resources. We lack a well-established method for selecting appropriate models for each individual input prior to decoding

Objectives

Methods

Results

Discussion

Conclusion