Abstract
Probabilistic language models, e.g. those based on recurrent neural networks such as long short-term memory models (LSTMs), often face the problem of finding a high probability prediction from a sequence of random variables over a set of tokens. This is commonly addressed using a form of greedy decoding such as beam search, where a limited number of highest-likelihood paths (the beam width) of the decoder are kept, and at the end the maximum-likelihood path is chosen. In this work, we construct a quantum algorithm to find the globally optimal parse (i.e. for infinite beam width) with high constant success probability. When the input to the decoder follows a power law with exponent k > 0, our algorithm has runtime Rnf(R, k), where R is the alphabet size, n the input length; here f < 1/2, and frightarrow 0 exponentially fast with increasing k, hence making our algorithm always more than quadratically faster than its classical counterpart. We further modify our procedure to recover a finite beam width variant, which enables an even stronger empirical speedup while still retaining higher accuracy than possible classically. Finally, we apply this quantum beam search decoder to Mozilla’s implementation of Baidu’s DeepSpeech neural net, which we show to exhibit such a power law word rank frequency.
Highlights
A recurring task in the context of parsing and neural sequence to sequence models—such as machine translation (Ilya et al 2011; Sutskever et al 2014), natural language processing (Schmidhuber 2014) and generative models (Graves 2013)— is to find an optimal path of tokens from a sequential list of probability distributions
Our novel algorithmic contribution is to analyse a recently developed quantum maximum finding algorithm (Apeldoorn et al 2017) and its expected runtime when provided with a biased quantum sampler that we developed for formal grammars, under the premise that at each step the input tokens follow a power-law distribution; for a probabilistic sequence obtained from Mozilla’s DeepSpeech, the quantum search decoder is a power of ≈ 4–5 faster than possible classically (Fig. 2)
We analyse the runtime of Algorithm 2 for various choices of beam width numerically, and analyse its performance on a concrete example— Mozilla’s DeepSpeech implementation, a speech-to-text long short-term memory models (LSTMs) which we show to follow a power-law token distribution at each output frame
Summary
A recurring task in the context of parsing and neural sequence to sequence models—such as machine translation (Ilya et al 2011; Sutskever et al 2014), natural language processing (Schmidhuber 2014) and generative models (Graves 2013)— is to find an optimal path of tokens (e.g. words or letters) from a sequential list of probability distributions Such a distribution can for instance be produced at the output layer of a recurrent neural network, e.g. a long short-term. A related task is found in transition based parsing of formal languages, such as context-free grammars (Hopcroft et al 2001; Zhang and Clark 2008; Zhang and features 2011; Zhu et al 2015; Dyer et al 2015) In this model, an input string is processed token by token, and a heuristic prediction
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have