Abstract

This paper investigates the use of character n-gram frequencies for identifying complex words in English, German and Spanish texts. The approach is based on the assumption that complex words are likely to contain different character sequences than simple words. The multinomial Naive Bayes classifier was used with n-grams of different lengths as features, and the best results were obtained for the combination of 2-grams and 4-grams. This variant was submitted to the Complex Word Identification Shared Task 2018 for all texts and achieved F-scores between 70% and 83%. The system was ranked in the middle range for all English texts, as third of fourteen submissions for German, and as tenth of seventeen submissions for Spanish. The method is not very convenient for the cross-language task, achieving only 59% on the French text.

Highlights

  • Complex Word Identification (CWI) refers to identification of words which are considered by readers from a specific target audience to be complex

  • The CWI task is the first step towards the lexical simplification task which aims at improving the readability of texts: a lexical simplification system should replace the identified complex words with their simpler synonyms

  • The first shared task on CWI was organized at the SemEval 2016 (Paetzold and Specia, 2016) where 21 teams submitted 42 systems trained to predict whether words in a given context were complex for a non-native English speaker

Read more

Summary

Introduction

Complex Word Identification (CWI) refers to identification of words which are considered by readers from a specific target audience to be complex. The CWI task is the first step towards the lexical simplification task which aims at improving the readability of texts: a lexical simplification system should replace the identified complex words with their simpler synonyms. Some of these systems have a CWI module at the beginning of their pipeline, e.g. The first shared task on CWI was organized at the SemEval 2016 (Paetzold and Specia, 2016) where 21 teams submitted 42 systems trained to predict whether words in a given context were complex for a non-native English speaker. The relation between character ngrams and word complexity intuitively depends on the language, we still decided to investigate crosslingual CWI and to participate in this track

Related work
Character n-grams and multinomial Naive Bayes classifier
Results
Standard set-up
Concatenated English training corpus
Cross-lingual classification
Confusion analysis
Official shared task results
Summary and outlook
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.