General Corpus Research Articles

BackgroundThere has been increasing interest in machine learning based natural language processing (NLP) methods in radiology; however, models have often used word embeddings trained on general web corpora due to lack of a radiology-specific corpus. PurposeWe examined the potential of Radiopaedia to serve as a general radiology corpus to produce radiology specific word embeddings that could be used to enhance performance on a NLP task on radiological text. Materials and methodsEmbeddings of dimension 50, 100, 200, and 300 were trained on articles collected from Radiopaedia using a GloVe algorithm and evaluated on analogy completion. A shallow neural network using input from either our trained embeddings or pre-trained Wikipedia 2014 + Gigaword 5 (WG) embeddings was used to label the Radiopaedia articles. Labeling performance was evaluated based on exact match accuracy and Hamming loss. The McNemar’s test with continuity and the Benjamini-Hochberg correction and a 5×2 cross validation paired two-tailed t-test were used to assess statistical significance. ResultsFor accuracy in the analogy task, 50-dimensional (50-D) Radiopaedia embeddings outperformed WG embeddings on tumor origin analogies (p < 0.05) and organ adjectives (p < 0.01) whereas WG embeddings tended to outperform on inflammation location and bone vs. muscle analogies (p < 0.01). The two embeddings had comparable performance on other subcategories. In the labeling task, the Radiopaedia-based model outperformed the WG based model at 50, 100, 200, and 300-D for exact match accuracy (p < 0.001, p < 0.001, p < 0.01, and p < 0.05, respectively) and Hamming loss (p < 0.001, p < 0.001, p < 0.01, and p < 0.05, respectively). ConclusionWe have developed a set of word embeddings from Radiopaedia and shown that they can preserve relevant medical semantics and augment performance on a radiology NLP task. Our results suggest that the cultivation of a radiology-specific corpus can benefit radiology NLP models in the future.

Read full abstract

Sentiment polarity classification in social media is a very important task, as it enables gathering trends on particular subjects given a set of opinions. Currently, a great advance has been made by using deep learning techniques, such as word embeddings, recurrent neural networks, and encoders, such as BERT. Unfortunately, these techniques require large amounts of data, which, in some cases, is not available. In order to model this situation, challenges, such as the Spanish TASS organized by the Spanish Society for Natural Language Processing (SEPLN), have been proposed, which pose particular difficulties: First, an unwieldy balance in the training and the test set, being this latter more than eight times the size of the training set. Another difficulty is the marked unbalance in the distribution of classes, which is also different between both sets. Finally, there are four different labels, which create the need to adapt current classifications methods for multiclass handling. Traditional machine learning methods, such as Naïve Bayes, Logistic Regression, and Support Vector Machines, achieve modest performance in these conditions, but used as an ensemble it is possible to attain competitive execution. Several strategies to build classifier ensembles have been proposed; this paper proposes estimating an optimal weighting scheme using a Differential Evolution algorithm focused on dealing with particular issues that multiclass classification and unbalanced corpora pose. The ensemble with the proposed optimized weighting scheme is able to improve the classification results on the full test set of the TASS challenge (General corpus), achieving state of the art performance when compared with other works on this task, which make no use of NLP techniques.

Read full abstract

General Corpus Research Articles

Related Topics

Articles published on General Corpus

The influence of the benchmark corpus on keyword analysis

DeIDNER Corpus: Annotation of Clinical Discharge Summary Notes for Named Entity Recognition Using BRAT Tool.

Experiência vivenciada por pessoas acometidas por COVID-19 no percurso da internação à alta hospitalar

Collocates of 'great' and 'good' in the Corpus of Contemporary American English and Indonesian EFL textbooks

A pre-training and self-training approach for biomedical named entity recognition.

SpaceTransformers: Language Modeling for Space Systems

Sentiment Analysis Using Multi-Head Attention Capsules With Multi-Channel CNN and Bidirectional GRU

Синтетические и аналитические аспектуальные формы монгольского глагола в количественном освещении

Структурно-вероятностная модель монгольской грамматики и измерение употребительности обобщенных грамматических единиц

The scope of criminal liability for misappropriation of authorship in EU countries: comparative analysis

Doing Corpus Linguistics: Toward a Conceptual Framework for Indicator of Gender in English Language and Education

Domain specific word embeddings for natural language processing in radiology

“Maleducados/Ill-mannered”during the #A28 political campaign on Twitter

Schadenfreude: Malicious Joy in Social Media Interactions.

Evolutionary Optimization of Ensemble Learning to Determine Sentiment Polarity in an Unbalanced Multiclass Corpus.

Caracterização das teses e dissertações brasileiras na área de Enfermagem em Unidade de Terapia Intensiva

ChoCo: a multimodal corpus of the Choctaw language

Sentiment Analysis using Multiple Word Embedding for Words

Using word embeddings for library and information science research

An acoustic phonetic description of Nungon vowels.

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

General Corpus Research Articles

Related Topics

Articles published on General Corpus

The influence of the benchmark corpus on keyword analysis

DeIDNER Corpus: Annotation of Clinical Discharge Summary Notes for Named Entity Recognition Using BRAT Tool.

Experiência vivenciada por pessoas acometidas por COVID-19 no percurso da internação à alta hospitalar

Collocates of 'great' and 'good' in the Corpus of Contemporary American English and Indonesian EFL textbooks

A pre-training and self-training approach for biomedical named entity recognition.

SpaceTransformers: Language Modeling for Space Systems

Sentiment Analysis Using Multi-Head Attention Capsules With Multi-Channel CNN and Bidirectional GRU

Синтетические и аналитические аспектуальные формы монгольского глагола в количественном освещении

Структурно-вероятностная модель монгольской грамматики и измерение употребительности обобщенных грамматических единиц

The scope of criminal liability for misappropriation of authorship in EU countries: comparative analysis

Doing Corpus Linguistics: Toward a Conceptual Framework for Indicator of Gender in English Language and Education

Domain specific word embeddings for natural language processing in radiology

“Maleducados/Ill-mannered”during the #A28 political campaign on Twitter

Schadenfreude: Malicious Joy in Social Media Interactions.

Evolutionary Optimization of Ensemble Learning to Determine Sentiment Polarity in an Unbalanced Multiclass Corpus.

Caracterização das teses e dissertações brasileiras na área de Enfermagem em Unidade de Terapia Intensiva

ChoCo: a multimodal corpus of the Choctaw language

Sentiment Analysis using Multiple Word Embedding for Words

Using word embeddings for library and information science research

An acoustic phonetic description of Nungon vowels.