What drives attention sinks? A study of massive activations and rotational positional encoding in large vision–language models
What drives attention sinks? A study of massive activations and rotational positional encoding in large vision–language models
- Conference Article
158
- 10.21437/interspeech.2019-2225
- Sep 15, 2019
We explore deep autoregressive Transformer models in language modeling for speech recognition. We focus on two aspects. First, we revisit Transformer model configurations specifically for language modeling. We show that well configured Transformer models outperform our baseline models based on the shallow stack of LSTM recurrent neural network layers. We carry out experiments on the open-source LibriSpeech 960hr task, for both 200K vocabulary word-level and 10K byte-pair encoding subword-level language modeling. We apply our word-level models to conventional hybrid speech recognition by lattice rescoring, and the subword-level models to attention based encoder-decoder models by shallow fusion. Second, we show that deep Transformer language models do not require positional encoding. The positional encoding is an essential augmentation for the self-attention mechanism which is invariant to sequence ordering. However, in autoregressive setup, as is the case for language modeling, the amount of information increases along the position dimension, which is a positional signal by its own. The analysis of attention weights shows that deep autoregressive self-attention models can automatically make use of such positional information. We find that removing the positional encoding even slightly improves the performance of these models.
- Research Article
5
- 10.1609/aaai.v35i14.17485
- May 18, 2021
- Proceedings of the AAAI Conference on Artificial Intelligence
Transformers are powerful for sequence modeling. Nearly all state-of-the-art language models and pre-trained language models are based on the Transformer architecture. However, it distinguishes sequential tokens only with the token position index. We hypothesize that better contextual representations can be generated from the Transformer with richer positional information. To verify this, we propose a segment-aware Transformer (Segatron), by replacing the original token position encoding with a combined position encoding of paragraph, sentence, and token. We first introduce the segment-aware mechanism to Transformer-XL, which is a popular Transformer-based language model with memory extension and relative position encoding. We find that our method can further improve the Transformer-XL base model and large model, achieving 17.1 perplexity on the WikiText-103 dataset. We further investigate the pre-training masked language modeling task with Segatron. Experimental results show that BERT pre-trained with Segatron (SegaBERT) can outperform BERT with vanilla Transformer on various NLP tasks, and outperforms RoBERTa on zero-shot sentence representation learning. Our code is available on GitHub.
- Video Transcripts
- 10.48448/gnbq-xb75
- Oct 21, 2021
In order to preserve word-order information in a non-autoregressive setting, transformer architectures tend to include positional knowledge, by (for instance) adding positional encodings to token embeddings. Several modifications have been proposed over the sinusoidal positional encodings used in the original transformer architecture; these include, for instance, separating position encodings and token embeddings, or directly modifying attention weights based on the distance between word pairs. We first show that surprisingly, while these modifications tend to improve monolingual language models, none of them result in better multilingual language models. We then answer why that is: Sinusoidal encodings were explicitly designed to facilitate compositionality by allowing linear projections over arbitrary time steps. Higher variances in multilingual training distributions requires higher compression, in which case, compositionality becomes indispensable. Learned absolute positional encodings (e.g., in mBERT) tend to approximate sinusoidal embeddings in multilingual settings, but more complex positional encoding architectures lack the inductive bias to effectively learn compositionality and cross-lingual alignment. In other words, while sinusoidal positional encodings were originally designed for monolingual applications, they are particularly useful in multilingual language models.
- Conference Article
- 10.18653/v1/2021.emnlp-main.59
- Jan 1, 2021
In order to preserve word-order information in a non-autoregressive setting, transformer architectures tend to include positional knowledge, by (for instance) adding positional encodings to token embeddings. Several modifications have been proposed over the sinusoidal positional encodings used in the original transformer architecture; these include, for instance, separating position encodings and token embeddings, or directly modifying attention weights based on the distance between word pairs. We first show that surprisingly, while these modifications tend to improve monolingual language models, none of them result in better multilingual language models. We then answer why that is: Sinusoidal encodings were explicitly designed to facilitate compositionality by allowing linear projections over arbitrary time steps. Higher variances in multilingual training distributions requires higher compression, in which case, compositionality becomes indispensable. Learned absolute positional encodings (e.g., in mBERT) tend to approximate sinusoidal embeddings in multilingual settings, but more complex positional encoding architectures lack the inductive bias to effectively learn compositionality and cross-lingual alignment. In other words, while sinusoidal positional encodings were originally designed for monolingual applications, they are particularly useful in multilingual language models.
- Conference Article
- 10.1109/nnice58320.2023.10105716
- Feb 24, 2023
Recently, the self-attention mechanism (Transformer) has shown its advantages in various natural language processing (NLP) tasks. Since positional information is crucial to NLP tasks, the positional encoding has become a critical factor in improving the performance of the Transformer. In this paper, we present a simple but effective complex-valued relative positional encoding (CRPE) method. Specifically, we map the query and key vectors to the complex domain based on their positions. Hence, the attention weights will directly contain the relative positional information by the dot product between the complex-valued query and key vectors. To demonstrate the effectiveness of our method, we use four typical NLP tasks: named entity recognition, text classification, machine translation, and language modeling. The datasets of these tasks comprise texts of varying lengths. In the experiments, our method outperforms the baseline positional encodings across all datasets. The results show that our method is more effective for long and short texts while containing fewer parameters.
- Preprint Article
- 10.31219/osf.io/p7nz9
- Jul 11, 2024
Language models have reached remarkable performance levels in natural language processing tasks, yet they continue to face challenges related to inference hallucination, which compromises the factual accuracy and reliability of generated content. The introduction of contextual positional double encoding represents a novel and significant advancement, providing a dual mechanism that simultaneously captures static positional information and dynamic contextual relationships among tokens. Modifications to the GPT-Neo architecture incorporated this encoding method, resulting in a model that demonstrated enhanced contextual awareness and reduced hallucination frequency. Comprehensive evaluations showed improvements in perplexity, BLEU scores, and qualitative assessments of text coherence and factual accuracy, demonstrating the method's effectiveness. The results indicate that the enhanced GPT-Neo model produces more reliable and contextually accurate outputs, addressing critical challenges in natural language processing and paving the way for more dependable AI-driven text generation systems. The findings highlight the potential of contextual enhancements to substantially improve the robustness and accuracy of language models.
- Research Article
46
- 10.1109/taslp.2021.3082299
- Jan 1, 2021
- IEEE/ACM Transactions on Audio, Speech, and Language Processing
Attention-based encoder-decoder (AED) models have achieved promising performance in speech recognition. However, because the decoder predicts text tokens (such as characters or words) in an autoregressive manner, it is difficult for an AED model to predict all tokens in parallel. This makes the inference speed relatively slow. We believe that because the encoder already captures the whole speech utterance, which has the token-level relationship implicitly, we can predict a token without explicitly autoregressive language modeling. When the prediction of a token does not rely on other tokens, the parallel prediction of all tokens in the sequence is realizable. Based on this idea, we propose a non-autoregressive speech recognition model called LASO (Listen Attentively, and Spell Once). The model consists of an encoder, a decoder, and a position dependent summarizer (PDS). The three modules are based on basic attention blocks. The encoder extracts high-level representations from the speech. The PDS uses positional encodings corresponding to tokens to convert the acoustic representations into token-level representations. The decoder further captures token-level relationships with the self-attention mechanism. At last, the probability distribution on the vocabulary is computed for each token position. Therefore, speech recognition is re-formulated as a position-wise classification problem. Further, we propose a cross-modal transfer learning method to refine semantics from a large-scale pre-trained language model BERT for improving the performance.
- Research Article
- 10.1609/aaai.v39i23.34637
- Apr 11, 2025
- Proceedings of the AAAI Conference on Artificial Intelligence
The capability of In-Context Learning (ICL) is crucial for large language models to generalize across a wide range of tasks. By utilizing prompts, these models can accurately predict outcomes for previously unseen tasks without necessitating retraining. However, this generalization ability does not extend to the length of the inputs; the effectiveness of ICL likely diminishes with excessively long inputs, resulting in errors in the generated text. To investigate this issue, we propose a study using a dataset of In-Context functions to understand the operational mechanisms of Transformer models in ICL and length generalization. We generated data using regression and Boolean functions and employed meta-learning techniques to endow the model with ICL capabilities. Our experimental results indicate that position encodings can significantly mitigate length generalization issues, with the most effective encoding extending the maximum input length to over eight times that of the original training length. However, further analysis revealed that while position encoding enhances length generalization, it compromises the model's inherent capabilities, such as its ability to generalize across different data types. Overall, our research illustrates that position encodings have a pronounced positive effect on length generalization, though it necessitates a careful trade-off with data generalization performance.
- Preprint Article
- 10.31219/osf.io/exjqb
- May 31, 2024
In natural language processing, maintaining factual accuracy and minimizing hallucinations in text generation remain significant challenges. Contextual Position Encoding (CPE) presents a novel approach by dynamically encoding positional information based on the context of each token, significantly enhancing the model's ability to generate accurate and coherent text. The integration of CPE into the Mistral Large model resulted in marked improvements in precision, recall, and F1-score, demonstrating superior performance over traditional positional encoding methods. Furthermore, the enhanced model architecture effectively reduced hallucination rates, increasing the reliability of the generated outputs. Comparative analysis with baseline models such as GPT-3 and BERT confirmed the efficacy of CPE, highlighting its potential to influence future developments in LLM architecture. The results underscore the importance of advanced positional encoding techniques in improving the performance and applicability of large language models across various domains requiring high factual accuracy.
- Research Article
1
- 10.1609/aaai.v36i11.21587
- Jun 28, 2022
- Proceedings of the AAAI Conference on Artificial Intelligence
NLP applications for code-mixed (CM) or mix-lingual text have gained a significant momentum recently, the main reason being the prevalence of language mixing in social media communications in multi-lingual societies like India, Mexico, Europe, parts of USA etc. Word embeddings are basic building blocks of any NLP system today, yet, word embedding for CM languages is an unexplored territory. The major bottleneck for CM word embeddings is switching points, where the language switches. These locations lack in contextually and statistical systems fail to model this phenomena due to high variance in the seen examples. In this paper we present our initial observations on applying switching point based positional encoding techniques for CM language, specifically Hinglish (Hindi - English). Results are only marginally better than SOTA, but it is evident that positional encoding could be an effective way to train position sensitive language models for CM text.
- Conference Article
- 10.1115/detc2023-116789
- Aug 20, 2023
Aspect-based sentiment analysis (ABSA) enables a systematic identification of user opinions on particular aspects, thus enhancing the idea creation process in the initial stages of product/service design. Attention-based large language models (LLMs) like BERT and T5 have proven powerful in ABSA tasks. Yet, several key limitations remain, both regarding the ABSA task and the capabilities of attention-based models. First, existing research mainly focuses on relatively simpler ABSA tasks such as aspect-based sentiment analysis, while the task of extracting aspect, opinion, and sentiment in a unified model remains largely unaddressed. Second, current ABSA tasks overlook implicit opinions and sentiments. Third, most attention-based LLMs like BERT use position encoding in a linear projected manner or through split-position relations in word distance schemes, which could lead to relation biases during the training process. This article addresses these gaps by (1) creating a new annotated dataset with five types of labels, including aspect, category, opinion, sentiment, and implicit indicator (ACOSI), (2) developing a unified model capable of extracting all five types of labels simultaneously in a generative manner, and (3) designing a new position encoding method in the attention-based model. The numerical experiments conducted on a manually labeled dataset scraped from three major e-Commerce retail stores for apparel and footwear products demonstrate the performance, scalability, and potential of the framework developed. The article concludes with recommendations for future research on automated need finding and sentiment analysis for user-centered design.
- Research Article
- 10.1186/s13321-025-00959-9
- Feb 5, 2025
- Journal of Cheminformatics
Recently, advancements in cheminformatics such as representation learning for chemical structures, deep learning (DL) for property prediction, data-driven discovery, and optimization of chemical data handling, have led to increased demands for handling chemical simplified molecular input line entry system (SMILES) data, particularly in text analysis tasks. These advancements have driven the need to optimize components like positional encoding and positional embeddings (PEs) in transformer model to better capture the sequential and contextual information embedded in molecular representations. SMILES data represent complex relationships among atoms or elements, rendering them critical for various learning tasks within the field of cheminformatics. This study addresses the critical challenge of encoding complex relationships among atoms in SMILES strings to explore various PEs within the transformer-based framework to increase the accuracy and generalization of molecular property predictions. The success of transformer-based models, such as the bidirectional encoder representations from transformer (BERT) models, in natural language processing tasks has sparked growing interest from the domain of cheminformatics. However, the performance of these models during pretraining and fine-tuning is significantly influenced by positional information such as PEs, which help in understanding the intricate relationships within sequences. Integrating position information within transformer architectures has emerged as a promising approach. This encoding mechanism provides essential supervision for modeling dependencies among elements situated at different positions within a given sequence. In this study, we first conduct pretraining experiments using various PEs to explore diverse methodologies for incorporating positional information into the BERT model for chemical text analysis using SMILES strings. Next, for each PE, we fine-tune the best-performing BERT (masked language modeling) model on downstream tasks for molecular-property prediction. Here, we use two molecular representations, SMILES and DeepSMILES, to comprehensively assess the potential and limitations of the PEs in zero-shot learning analysis, demonstrating the model’s proficiency in predicting properties of unseen molecular representations in the context of newly proposed and existing datasets.Scientific contributionThis study explores the unexplored potential of PEs using BERT model for molecular property prediction. The study involved pretraining and fine-tuning the BERT model on various datasets related to COVID-19, bioassay data, and other molecular and biological properties using SMILES and DeepSMILES representations. The study details the pretraining architecture, fine-tuning datasets, and the performance of the BERT model with different PEs. It also explores zero-shot learning analysis and the model’s performance on various classification and regression tasks. In this study, newly proposed datasets from different domains were introduced during fine-tuning in addition to the existing and commonly used datasets. The study highlights the robustness of the BERT model in predicting chemical properties and its potential applications in cheminformatics and bioinformatics.
- Conference Article
87
- 10.1109/icdar.2019.00208
- Sep 1, 2019
Encoder-decoder models have become an effective approach for sequence learning tasks like machine translation, image captioning and speech recognition, but have yet to show competitive results for handwritten text recognition. To this end, we propose an attention-based sequence-to-sequence model. It combines a convolutional neural network as a generic feature extractor with a recurrent neural network to encode both the visual information, as well as the temporal context between characters in the input image, and uses a separate recurrent neural network to decode the actual character sequence. We make experimental comparisons between various attention mechanisms and positional encodings, in order to find an appropriate alignment between the input and output sequence. The model can be trained end-to-end and the optional integration of a hybrid loss allows the encoder to retain an interpretable and usable output, if desired. We achieve competitive results on the IAM and ICFHR2016 READ data sets compared to the state-of-the-art without the use of a language model, and we significantly improve over any recent sequence-to-sequence approaches.
- Preprint Article
- 10.31219/osf.io/9ds4a_v1
- Feb 26, 2025
The advent of the Transformer architecture has revolutionized Natural Language Processing (NLP), rendering traditional recurrent neural networks (RNNs) obsolete through innovations like self-attention and parallelization. This review paper serves a dual purpose: (1) to provide a theoretical deep-dive into the Transformer’s architecture, dissecting its core components—self-attention mechanisms, positional encoding, and encoder-decoder frameworks—and (2) to deliver practical, educational value through a modular, open-source PyTorch implementation. By translating theory into executable code, we demystify the "black box" of Transformers, enabling researchers, educators, and developers to experiment with state-of-the-art NLP tools even under hardware constraints.Our implementation, tested on an English-Hindi translation task. Beyond technical analysis, this paper emphasizes the social impact of democratizing AI education, showing how accessible frameworks can empower global communities to address challenges in language preservation, healthcare, and education. By bridging the gap between theoretical understanding and hands-on application.
- Conference Article
- 10.5281/zenodo.3362981
- Sep 20, 2019
- Zenodo (CERN European Organization for Nuclear Research)
Encoder-decoder models have become an effective approach for sequence learning tasks like machine translation, image captioning and speech recognition, but have yet to show competitive results for handwritten text recognition. To this end, we propose an attention-based sequence-to-sequence model. It combines a convolutional neural network as a generic feature extractor with a recurrent neural network to encode both the visual information, as well as the temporal context between characters in the input image, and uses a separate recurrent neural network to decode the actual character sequence. We make experimental comparisons between various attention mechanisms and positional encodings, in order to find an appropriate alignment between the input and output sequence. The model can be trained end-to-end and the optional integration of a hybrid loss allows the encoder to retain an interpretable and usable output, if desired. We achieve competitive results on the IAM and ICFHR2016 READ data sets compared to the state-of-the-art without the use of a language model, and we significantly improve over any recent sequence-to-sequence approaches.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.