Contextualized Word Embeddings Research Articles

Increased interest in the use of word embeddings, such as word representation, for biomedical named entity recognition (BioNER) has highlighted the need for evaluations that aid in selecting the best word embedding to be used. One common criterion for selecting a word embedding is the type of source from which it is generated; that is, general (e.g., Wikipedia, Common Crawl), or specific (e.g., biomedical literature). Using specific word embeddings for the BioNER task has been strongly recommended, considering that they have provided better coverage and semantic relationships among medical entities. To the best of our knowledge, most studies have focused on improving BioNER task performance by, on the one hand, combining several features extracted from the text (for instance, linguistic, morphological, character embedding, and word embedding itself) and, on the other, testing several state-of-the-art named entity recognition algorithms. The latter, however, do not pay great attention to the influence of the word embeddings, and do not facilitate observing their real impact on the BioNER task. For this reason, the present study evaluates three well-known NER algorithms (CRF, BiLSTM, BiLSTM-CRF) with respect to two corpora (DrugBank and MedLine) using two classic word embeddings, GloVe Common Crawl (of the general type) and Pyysalo PM + PMC (specific), as unique features. Furthermore, three contextualized word embeddings (ELMo, Pooled Flair, and Transformer) are compared in their general and specific versions. The aim is to determine whether general embeddings can perform better than specialized ones on the BioNER task. To this end, four experiments were designed. In the first, we set out to identify the combination of classic word embedding, NER algorithm, and corpus that results in the best performance. The second evaluated the effect of the size of the corpus on performance. The third assessed the semantic cohesiveness of the classic word embeddings and their correlation with respect to several gold standards; while the fourth evaluates the performance of general and specific contextualized word embeddings on the BioNER task. Results show that the classic general word embedding GloVe Common Crawl performed better in the DrugBank corpus, despite having less word coverage and a lower internal semantic relationship than the classic specific word embedding, Pyysalo PM + PMC; while in the contextualized word embeddings the best results are presented in the specific ones. We conclude, therefore, when using classic word embeddings as features on the BioNER task, the general ones could be considered a good option. On the other hand, when using contextualized word embeddings, the specific ones are the best option.

Read full abstract

Context. In the current information era, the problem of analyzing large volumes of unlabeled textual data and its further grouping with respect to the semantic similarity between texts is emerging. This raises the need for robust text analysis algorithms, namely, clustering and extraction of key data from texts. Despite recent progress in the field of natural language processing, new neural methods lack interpretability when used for unsupervised tasks, whereas traditional distributed semantics and word counting techniques tend to disregard contextual information.Objective. The objective of the study is to develop an interpretable text clustering and cluster labeling methods with respect to the semantic similarity that require no additional training on the user’s dataset. Method. To approach the task of text clustering, we incorporate deep contextualized word embeddings and analyze their evolution through layers of pretrained transformer models. Given word embeddings, we look for similar tokens across all corpus and form topics that are present in multiple sentences. We merge topics so that sentences that share many topics are assigned to one cluster. One sentence can contain a few topics, it can be present in more then one cluster simultaneously. Similarly, to generate labels for the existing cluster, we use token embeddings to order them based on how much they are descriptive of the cluster. To do so, we propose a novel metric – token rank measure and evaluate two other metrics.Results. A new unsupervised text clustering approach was described and implemented. It is capable of assigning a text to different clusters based on semantic similarity to other texts in the group. A keyword extraction approach was developed and applied in both text clustering and cluster labeling tasks. Obtained clusters are annotated and can be interpreted through the terms that formed the clusters.Conclusions. Evaluation on different datasets demonstrated applicability, relevance, and interpretability of the obtained results. The advantages and possible improvements to the proposed methods were described. Recommendations for using methods were provided, as well as possible modifications.

Read full abstract

Contextualized Word Embeddings Research Articles

Related Topics

Articles published on Contextualized Word Embeddings

MenuNER: Domain-Adapted BERT Based NER Approach for a Domain with Limited Dataset and Its Application to Food Menu Domain

Circles are like Ellipses, or Ellipses are like Circles? Measuring the Degree of Asymmetry of Static and Contextual Word Embeddings and the Implications to Representation Learning

A Deep Learning Sentiment Analyser for Social Media Comments in Low-Resource Languages

An empirical evaluation of text representation schemes to filter the social media stream

Stance detection with BERT embeddings for credibility analysis of information on social media.

Biomedical event trigger extraction based on multi-layer residual BiLSTM and contextualized word representations

Exploiting Contextual Word Embedding of Authorship and Title of Articles for Discovering Citation Intent Classification

Hybrid Deep Learning for Medication-Related Information Extraction From Clinical Texts in French: MedExt Algorithm Development Study.

Prepositional Polysemy through the lens of contextualized word embeddings

Comparing general and specialized word embeddings for biomedical named entity recognition.

Technical Note: An embedding-based medical note de-identification approach with sparse annotation.

Neural Networks with Emotion Associations, Topic Modeling and Supervised Term Weighting for Sentiment Analysis.

FAD-BERT: Improved prediction of FAD binding sites using pre-training of deep bidirectional transformers

GT-Finder: Classify the family of glucose transporters with pre-trained BERT language models

A transformer architecture based on BERT and 2D convolutional neural network to identify DNA enhancers from sequence information.

Early Detection of Severe Flu Outbreaks using Contextual Word Embeddings

Learning Cross-Lingual Mappings in Imperfectly Isomorphic Embedding Spaces

MULTITOPIC TEXT CLUSTERING AND CLUSTER LABELING USING CONTEXTUALIZED WORD EMBEDDINGS

A Neural Generative Model for Joint Learning Topics and Topic-Specific Word Embeddings

PERL: Pivot-based Domain Adaptation for Pre-trained Deep Contextualized Embedding Models

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Contextualized Word Embeddings Research Articles

Related Topics

Articles published on Contextualized Word Embeddings

MenuNER: Domain-Adapted BERT Based NER Approach for a Domain with Limited Dataset and Its Application to Food Menu Domain

Circles are like Ellipses, or Ellipses are like Circles? Measuring the Degree of Asymmetry of Static and Contextual Word Embeddings and the Implications to Representation Learning

A Deep Learning Sentiment Analyser for Social Media Comments in Low-Resource Languages

An empirical evaluation of text representation schemes to filter the social media stream

Stance detection with BERT embeddings for credibility analysis of information on social media.

Biomedical event trigger extraction based on multi-layer residual BiLSTM and contextualized word representations

Exploiting Contextual Word Embedding of Authorship and Title of Articles for Discovering Citation Intent Classification

Hybrid Deep Learning for Medication-Related Information Extraction From Clinical Texts in French: MedExt Algorithm Development Study.

Prepositional Polysemy through the lens of contextualized word embeddings

Comparing general and specialized word embeddings for biomedical named entity recognition.

Technical Note: An embedding-based medical note de-identification approach with sparse annotation.

Neural Networks with Emotion Associations, Topic Modeling and Supervised Term Weighting for Sentiment Analysis.

FAD-BERT: Improved prediction of FAD binding sites using pre-training of deep bidirectional transformers

GT-Finder: Classify the family of glucose transporters with pre-trained BERT language models

A transformer architecture based on BERT and 2D convolutional neural network to identify DNA enhancers from sequence information.

Early Detection of Severe Flu Outbreaks using Contextual Word Embeddings

Learning Cross-Lingual Mappings in Imperfectly Isomorphic Embedding Spaces

MULTITOPIC TEXT CLUSTERING AND CLUSTER LABELING USING CONTEXTUALIZED WORD EMBEDDINGS

A Neural Generative Model for Joint Learning Topics and Topic-Specific Word Embeddings

PERL: Pivot-based Domain Adaptation for Pre-trained Deep Contextualized Embedding Models