Abstract

Transformer with the self-attention mechanism, which allows fully-connected contextual encoding over input tokens, has achieved outstanding performances in various NLP tasks, but it suffers from quadratic complexity with the input sequence length. Long-range contexts are often tackled by Transformer in chunks using a sliding window to avoid GPU memory overflow. However, how to achieve considerable performance on downstream tasks under the premise of modeling sequences as long as possible with limited GPU resources is still a problem to be explored. To address this issue, we propose a new framework using hybrid attention-based Transformer to capture long-range contextual features. More specifically, we make a combination of three types of attention, i.e. sliding window local attention, clustering-based long-range attention and specific global attention. Experiments comparing the performance of our model with mainstream efficient improved Transformers are conducted on document classification task in public datasets IMDB and CAIL-Long. Experimental results demonstrate that the proposed approach can outperform the state-of-the-art models on the adopted datasets, which verifies the effectiveness of our model on long-range document classification task.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call