Abstract

Compared with the vanilla transformer, the window-based transformer offers a better trade-off between accuracy and efficiency. Although the window-based transformer has made great progress, its long-range modeling capabilities are limited due to the size of the local window and the window connection scheme. To address this problem, we propose a novel Window Token Transformer (WTT). The core mechanism of WTT is the addition of a window token for summarizing window information in each local window. We refer to this type of token interaction as Window Token Attention. These window tokens will interact spatially with the tokens in each window to enable long-range modeling. In order to preserve the hierarchical design of the window-based transformer, we design Feature Inheritance Module (FIM) in each phase of WTT to deliver the local window information from the previous phase to the window token in the next phase. In addition, we have designed a Global–Local Feedforward Network (GLFFN) in WTT, which can enhance the local awareness of the network while preserving the global awareness. Extensive experiments have shown that our WTT achieves competitive results with low parameters in image classification and downstream tasks.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.