Abstract
We describe the UTFPR systems submitted to the Lexical Complexity Prediction shared task of SemEval 2021. They perform complexity prediction by combining classic features, such as word frequency, n-gram frequency, word length, and number of senses, with BERT vectors. We test numerous feature combinations and machine learning models in our experiments and find that BERT vectors, even if not optimized for the task at hand, are a great complement to classic features. We also find that employing the principle of compositionality can potentially help in phrase complexity prediction. Our systems place 45th out of 55 for single words and 29th out of 38 for phrases.
Highlights
Measuring the complexity of words can be useful in many ways. It facilitates the creation of text simplification technologies that could, for example, help in identifying and adapting challenging excerpts of literary pieces targeting specific groups, such as children (De Belder and Moens, 2010) and second language learners (Paetzold and Specia, 2016e), and make news articles and official documents more accessible to the general population (Paetzold and Specia, 2016a). This task has received a considerable amount of attention in the past few years, especially due to the popularity of the Complex Word Identification (CWI) shared tasks of 2016 (Paetzold and Specia, 2016c) and 2018 (Yimam et al, 2018), where dozens of teams were challenged to judge the complexity of words in context
While it has been observed that word frequencies tend to drive the performance of effective complexity prediction systems (Paetzold and Specia, 2016c), we hypothesize that the wealth of knowledge present in transformerbased models such as BERT can help in extracting complementary contextual complexity clues
Frequencies were calculated using a 5-gram language model trained over family movies from SubIMDB
Summary
Measuring the complexity of words can be useful in many ways. It facilitates the creation of text simplification technologies that could, for example, help in identifying and adapting challenging excerpts of literary pieces targeting specific groups, such as children (De Belder and Moens, 2010) and second language learners (Paetzold and Specia, 2016e), and make news articles and official documents more accessible to the general population (Paetzold and Specia, 2016a). The majority of the most successful systems submitted to these shared tasks combined ensemble methods, such as Random Forests (Ho, 1995) and AdaBoost (Freund and Schapire, 1997) with numerous linguistic features, including word frequencies, n-gram frequencies, word length, number of senses, number of syllables, psycholinguistic metrics, and word embeddings (Konkol, 2016; Malmasi et al, 2016; Paetzold and Specia, 2016d; Gooding and Kochmar, 2018; Hartmann and Dos Santos, 2018) Because these tasks were held prior to the ascension of transformer-based masked language models, such as BERT (Devlin et al, 2019) and ROBERTA (Liu et al, 2019), we could not find any systems that exploited the power of the features produced by them. We present the task being addressed (Section 2), our approach (Section 3), some preliminary experiments (Section 4), our final shared task results (Section 5), and our conclusions (Section 6)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.