UTFPR at SemEval-2021 Task 1: Complexity Prediction by Combining BERT Vectors and Classic Features

Gustavo Henrique Paetzold

doi:10.18653/v1/2021.semeval-1.78

Abstract

We describe the UTFPR systems submitted to the Lexical Complexity Prediction shared task of SemEval 2021. They perform complexity prediction by combining classic features, such as word frequency, n-gram frequency, word length, and number of senses, with BERT vectors. We test numerous feature combinations and machine learning models in our experiments and find that BERT vectors, even if not optimized for the task at hand, are a great complement to classic features. We also find that employing the principle of compositionality can potentially help in phrase complexity prediction. Our systems place 45th out of 55 for single words and 29th out of 38 for phrases.

Highlights

Measuring the complexity of words can be useful in many ways. It facilitates the creation of text simplification technologies that could, for example, help in identifying and adapting challenging excerpts of literary pieces targeting specific groups, such as children (De Belder and Moens, 2010) and second language learners (Paetzold and Specia, 2016e), and make news articles and official documents more accessible to the general population (Paetzold and Specia, 2016a). This task has received a considerable amount of attention in the past few years, especially due to the popularity of the Complex Word Identification (CWI) shared tasks of 2016 (Paetzold and Specia, 2016c) and 2018 (Yimam et al, 2018), where dozens of teams were challenged to judge the complexity of words in context
While it has been observed that word frequencies tend to drive the performance of effective complexity prediction systems (Paetzold and Specia, 2016c), we hypothesize that the wealth of knowledge present in transformerbased models such as BERT can help in extracting complementary contextual complexity clues
Frequencies were calculated using a 5-gram language model trained over family movies from SubIMDB

Summary

Introduction

Measuring the complexity of words can be useful in many ways. It facilitates the creation of text simplification technologies that could, for example, help in identifying and adapting challenging excerpts of literary pieces targeting specific groups, such as children (De Belder and Moens, 2010) and second language learners (Paetzold and Specia, 2016e), and make news articles and official documents more accessible to the general population (Paetzold and Specia, 2016a). The majority of the most successful systems submitted to these shared tasks combined ensemble methods, such as Random Forests (Ho, 1995) and AdaBoost (Freund and Schapire, 1997) with numerous linguistic features, including word frequencies, n-gram frequencies, word length, number of senses, number of syllables, psycholinguistic metrics, and word embeddings (Konkol, 2016; Malmasi et al, 2016; Paetzold and Specia, 2016d; Gooding and Kochmar, 2018; Hartmann and Dos Santos, 2018) Because these tasks were held prior to the ascension of transformer-based masked language models, such as BERT (Devlin et al, 2019) and ROBERTA (Liu et al, 2019), we could not find any systems that exploited the power of the features produced by them. We present the task being addressed (Section 2), our approach (Section 3), some preliminary experiments (Section 4), our final shared task results (Section 5), and our conclusions (Section 6)

Task Description

Approach

Features

Preliminary Experiments

Corpora Analysis

Phrase Compositionality

Feature Selection

Task Results

Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

UTFPR at SemEval-2021 Task 1: Complexity Prediction by Combining BERT Vectors and Classic Features

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jan 1, 2021
Citations: 2	License type: cc-by

Similar Papers

Aphasia and spelling to dictation: Analysis of spelling errors and editing.
Charlotte Johansson‐Malmeling ... Åsa Wengelin
International journal of language & communication disorders | VOL. 56
Charlotte Johansson‐Malmeling, et. al.Charlotte Johansson‐Malmeling ... Åsa Wengelin
27 Dec 2020
International journal of language & communication disorders | VOL. 56

The Locus of Word Length and Frequency Effect in Comprehending English Words by Korean-English Bilinguals and Americans
Kichun Nam ... Yoonhyong Lee
-
Kichun Nam, et. al.Kichun Nam ... Yoonhyong Lee
01 Jan 2004
01 Jan 2004

Author response: An oscillating computational model can track pseudo-rhythmic speech by using linguistic predictions
Sanne ten Oever ... Andrea E Martin
-
Sanne ten Oever, et. al.Sanne ten Oever ... Andrea E Martin
21 Jun 2021
21 Jun 2021

Reading Development, Word Length and Frequency Effects: An Eye-Tracking Study with Slow and Fast Readers
Sabrina Gerth ... Julia Festman
Frontiers in Communication | VOL. 6
Sabrina Gerth, et. al.Sabrina Gerth ... Julia Festman
28 Sep 2021
Frontiers in Communication | VOL. 6

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

UTFPR at SemEval-2021 Task 1: Complexity Prediction by Combining BERT Vectors and Classic Features

Abstract

Highlights

Summary

Talk to us

Similar Papers