Abstract

Obtaining meaning-rich representations of social media inputs, such as Tweets (unstructured and noisy text), from general-purpose pre-trained language models has become challenging, as these inputs typically deviate from mainstream English usage. The proposed research establishes effective methods for improving the comprehension of noisy texts. For this, we propose a new generic methodology to derive a diverse set of sentence vectors combining and extracting various linguistic characteristics from latent representations of multi-layer, pre-trained language models. Further, we clearly establish how BERT, a state-of-the-art pre-trained language model, comprehends the linguistic attributes of Tweets to identify appropriate sentence representations. Five new probing tasks are developed for Tweets, which can serve as benchmark probing tasks to study noisy text comprehension. Experiments are carried out for classification accuracy by deriving the sentence vectors from GloVe-based pre-trained models and Sentence-BERT, and by using different hidden layers from the BERT model. We show that the initial and middle layers of BERT have better capability for capturing the key linguistic characteristics of noisy texts than its latter layers. With complex predictive models, we further show that the sentence vector length has lesser importance to capture linguistic information, and the proposed sentence vectors for noisy texts perform better than the existing state-of-the-art sentence vectors.

Highlights

  • Natural Language Processing (NLP) and its subfield, Natural Language Understanding (NLU), primarily focuses on the well-known complex problem of machine reading comprehension

  • We analyze the distribution of the language understanding across the various regions of the Bidirectional Encoder Representations from Transformers (BERT) model proposed for this study

  • The research work reported in this paper demonstrates that the general language understanding of pre-trained language models, such as BERT, can be effectively exploited to comprehend noisy texts

Read more

Summary

Introduction

Natural Language Processing (NLP) and its subfield, Natural Language Understanding (NLU), primarily focuses on the well-known complex problem of machine reading comprehension. While a plethora of techniques have already been proposed, representing sentences as vectors of real numbers in high dimensional continuous space is still attracting attention [1,2] For vector representation, both word and sentence embeddings have influenced the representation, following the rapid rise of Word2Vec [3]. Word embedding [20] has become popular as a de facto starting point for representing the meaning of words Static methods, such as Word2Vec [3], GloVe [5], and FastText [21] generally generate fixed word representations in a vocabulary. These techniques cannot be adapted to identify the contextual meaning of a word.

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call