From General Language Understanding to Noisy Text Comprehension

Buddhika Kasthuriarachchy,Madhu Chetty,Adrian Shatte,Darren Walls

doi:10.3390/app11177814

Abstract

Obtaining meaning-rich representations of social media inputs, such as Tweets (unstructured and noisy text), from general-purpose pre-trained language models has become challenging, as these inputs typically deviate from mainstream English usage. The proposed research establishes effective methods for improving the comprehension of noisy texts. For this, we propose a new generic methodology to derive a diverse set of sentence vectors combining and extracting various linguistic characteristics from latent representations of multi-layer, pre-trained language models. Further, we clearly establish how BERT, a state-of-the-art pre-trained language model, comprehends the linguistic attributes of Tweets to identify appropriate sentence representations. Five new probing tasks are developed for Tweets, which can serve as benchmark probing tasks to study noisy text comprehension. Experiments are carried out for classification accuracy by deriving the sentence vectors from GloVe-based pre-trained models and Sentence-BERT, and by using different hidden layers from the BERT model. We show that the initial and middle layers of BERT have better capability for capturing the key linguistic characteristics of noisy texts than its latter layers. With complex predictive models, we further show that the sentence vector length has lesser importance to capture linguistic information, and the proposed sentence vectors for noisy texts perform better than the existing state-of-the-art sentence vectors.

Highlights

Natural Language Processing (NLP) and its subfield, Natural Language Understanding (NLU), primarily focuses on the well-known complex problem of machine reading comprehension
We analyze the distribution of the language understanding across the various regions of the Bidirectional Encoder Representations from Transformers (BERT) model proposed for this study
The research work reported in this paper demonstrates that the general language understanding of pre-trained language models, such as BERT, can be effectively exploited to comprehend noisy texts

Summary

Introduction

Natural Language Processing (NLP) and its subfield, Natural Language Understanding (NLU), primarily focuses on the well-known complex problem of machine reading comprehension. While a plethora of techniques have already been proposed, representing sentences as vectors of real numbers in high dimensional continuous space is still attracting attention [1,2] For vector representation, both word and sentence embeddings have influenced the representation, following the rapid rise of Word2Vec [3]. Word embedding [20] has become popular as a de facto starting point for representing the meaning of words Static methods, such as Word2Vec [3], GloVe [5], and FastText [21] generally generate fixed word representations in a vocabulary. These techniques cannot be adapted to identify the contextual meaning of a word.

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Applied Sciences	Publication Date: Aug 25, 2021
Citations: 4	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

From General Language Understanding to Noisy Text Comprehension

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences

Lead the way for us

Similar Papers

Fine-tuning Pre-trained Language Models to Detect In-game Trash Talks
Daniel Fesalbon ... Arvin De La Cruz
International Journal For Multidisciplinary Research | VOL. 6
Daniel Fesalbon , et. al.Daniel Fesalbon ... Arvin De La Cruz
13 Mar 2024
International Journal For Multidisciplinary Research | VOL. 6

Neural Transfer Learning For Vietnamese Sentiment Analysis Using Pre-trained Contextual Language Models
An Pha Le ... Tran Vu Pham
-
An Pha Le, et. al.An Pha Le ... Tran Vu Pham
16 Dec 2021
16 Dec 2021

A Multi-tasking and Multi-stage Chinese Minority Pre-trained Language Model
Bin Li ... Bin Sun
-
Bin Li, et. al.Bin Li ... Bin Sun
01 Jan 2021
01 Jan 2021

Towards an Enhanced Understanding of Bias in Pre-trained Neural Language Models: A Survey with Special Emphasis on Affective Bias
Anoop K ... Lajish V L
-
Anoop K, et. al. Anoop K ... Lajish V L
01 Jan 2021
01 Jan 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

From General Language Understanding to Noisy Text Comprehension

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences