Fast training of a transformer for global multi-horizon time series forecasting on tensor processing units

J.-Luis García-Nava,Victor M Tellez,Felix Calderon,Juan J Flores

doi:10.1007/s11227-022-05009-x

Abstract

Time Series Forecasting (TSF) is essential to key domains, and the Transformer neural network has advanced the state-of-the-art on global, multi-horizon TSF benchmarks. The quadratic time and memory complexity of the Vanilla Transformer (VT) hinders its application to Big Data environments; therefore, multiple efficient variants of the VT that lower complexity via sparse self-attention have been proposed. However, less complex algorithms do not directly produce faster executions, and machine learning models for Big Data are typically trained on accelerators designed for dense-matrix computation that render slower performance with sparse matrices. To better compare the accuracy-speed trade-off of the VT and its variants, it is essential to test them on such accelerators. We implemented a cloud-based VT on Tensor Processing Units to address this task. Experiments on large-scale datasets show that our Transformer achieves good predictive performance when compared to state-of-the-art models while reducing training times from hours to under 2 min.

Full Text