Token Group Research Articles

In the domain of video recognition, video transformers have demonstrated remarkable performance, albeit at significant computational cost. This paper introduces TSNet, an innovative approach for dynamically selecting informative tokens from given video samples. The proposed method involves a lightweight prediction module that assigns importance scores to each token in the video. Tokens with top scores are then utilized for self-attention computation. We apply the Gumbel-softmax technique to sample from the output of the prediction module, enabling end-to-end optimization of the prediction module. We aim to extend our method on hierarchical vision transformers rather than single-scale vision transformers. We use a simple linear module to project the pruned tokens, and the projected result is then concatenated with the output of the self-attention network to maintain the same number of tokens while capturing interactions with the selected tokens. Since feedforward networks (FFNs) contribute significant computation, we also propose linear projection for the pruned tokens to accelerate the model, and the existing FFN layer progresses the selected tokens. Finally, in order to ensure that the structure of the output remains unchanged, the two groups of tokens are reassembled based on their spatial positions in the original feature map. The experiments conducted primarily focus on the Kinetics-400 dataset using UniFormer, a hierarchical video transformer backbone that incorporates convolution in its self-attention block. Our model demonstrates comparable results to the original model while reducing computation by over 13%. Notably, by hierarchically pruning 70% of input tokens, our approach significantly decreases 55.5% of the FLOPs, while the decline in accuracy is confined to 2%. Additional testing of wide applicability and adaptability with other transformers such as the Video Swin Transformer was also performed and indicated its progressive potentials in video recognition benchmarks. By implementing our token sparsification framework, video vision transformers can achieve a remarkable balance between enhanced computational speed and a slight reduction in accuracy.

Fine-grained recognition mainly classifies subclass images into hundreds of subcategorical labels by locating the discriminative regions (e.g., Cape May warbler or Magnolia warbler bird). Due to the high complexity and non-differentiation of region locations through the traditional backbone architecture, most existing approaches utilize multi-level reinforcement learning to distinguish the similar appearance among sub-categories. These methods explore incomplete information through only the intra-class informative regions in one image or the inter-class and intra-class relationship in pairwise images, leading to the tendency for overlapped region locations. Since the inter-class correlations and new backbone with complete contextual semantic information play important roles in distinguishing fine-grained classes, we propose a novel transformer with the collaborative token mining (TCTM) scheme by fully exploiting the relationships between inter-class and intra-class regions. The proposed TCTM scheme with a new transformer backbone consists of two modules that collaboratively explore the spatially aware tokens: the Pyramid Tokens Multiplication (PTM) module which exploits the integrated multi-stage inter-class and intra-class correlations from new transformer architecture and the Tokens Proposals Generation (TPG) module which captures two groups of top-four discriminative tokens. The two PTMs extract contrastive tokens for each image and learn to rank these tokens, assuming that those from the same class and the same module should have smaller distances. The TPGs further sort and update the candidate tokens from the extracted attention tokens by ranking their probabilities with ground truth subcategorical labels. Through the collaboration between the PTM and TPG, our TCTM scheme can take the integrated correlations into account and mine the discriminative tokens for final fine-grained classification. Extensive experiments on four popular benchmarks show that our proposed TCTM outperforms the state-of-the-art methods by a large margin.

Token Group Research Articles

Articles published on Token Group

Fusion Attention for Action Recognition: Integrating Sparse-Dense and Global Attention for Video Action Recognition

Selective Information Flow for Transformer Tracking

Semantic-aware Message Broadcasting for Efficient Unsupervised Domain Adaptation.

Proposal of a Token-Based Node Selection Mechanism for Node Distribution of Mobility IoT Blockchain Nodes.

TSNet: Token Sparsification for Efficient Video Transformer

An Integrated Transformer with Collaborative Tokens Mining for Fine-Grained Recognition

The Multi-Hot Representation-Based Language Model to Maintain Morpheme Units

NFTs and copyright: challenges and opportunities

LET-Decoder: A WFST-Based Lazy-Evaluation Token-Group Decoder With Exact Lattice Generation

Концепт «етнічність» в сучасних британських словниках (на матеріалі словника OALD).

Intellectual Innovations in Georgia (11th-9th Centuries BC)

The Influence of Incidental Tokenism on Private Evaluations of Stereotype-Typifying Products

COLOR NAMES "WHITE" AND "BLACK" IN DESCRIPTIONS OF PERSON IN THE KAZAKH LANGUAGE VS. ENGLISH AND RUSSIAN

Developmental network structure and support: gendered consequences for work–family strain and work–parenting strain in the Australian mining industry

Perceptually relevant grouping of image tokens on the basis of constraint propagation from local binary patterns

Extreme Violence and the Media: Challenges of Reporting Terrorism in Nigeria

CapSG를 이용한 IoT 서비스 접근제어 플랫폼

Primary and secondary effects of processing instruction on Spanish clitic pronouns

A Fuzzy Logic Approach to Wrapping PDF Documents

First Passage Time Computation in Tagged GSPNs with Queue Places

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Token Group Research Articles

Articles published on Token Group

Fusion Attention for Action Recognition: Integrating Sparse-Dense and Global Attention for Video Action Recognition

Selective Information Flow for Transformer Tracking

Semantic-aware Message Broadcasting for Efficient Unsupervised Domain Adaptation.

Proposal of a Token-Based Node Selection Mechanism for Node Distribution of Mobility IoT Blockchain Nodes.

TSNet: Token Sparsification for Efficient Video Transformer

An Integrated Transformer with Collaborative Tokens Mining for Fine-Grained Recognition

The Multi-Hot Representation-Based Language Model to Maintain Morpheme Units

NFTs and copyright: challenges and opportunities

LET-Decoder: A WFST-Based Lazy-Evaluation Token-Group Decoder With Exact Lattice Generation

Концепт «етнічність» в сучасних британських словниках (на матеріалі словника OALD).

Intellectual Innovations in Georgia (11th-9th Centuries BC)

The Influence of Incidental Tokenism on Private Evaluations of Stereotype-Typifying Products

COLOR NAMES "WHITE" AND "BLACK" IN DESCRIPTIONS OF PERSON IN THE KAZAKH LANGUAGE VS. ENGLISH AND RUSSIAN

Developmental network structure and support: gendered consequences for work–family strain and work–parenting strain in the Australian mining industry

Perceptually relevant grouping of image tokens on the basis of constraint propagation from local binary patterns

Extreme Violence and the Media: Challenges of Reporting Terrorism in Nigeria

CapSG를 이용한 IoT 서비스 접근제어 플랫폼

Primary and secondary effects of processing instruction on Spanish clitic pronouns

A Fuzzy Logic Approach to Wrapping PDF Documents

First Passage Time Computation in Tagged GSPNs with Queue Places