Abstract

While language identification works well on standard texts, it performs much worse on social media language, in particular dialectal language—even for English. First, to support work on English language identification, we contribute a new dataset of tweets annotated for English versus non-English, with attention to ambiguity, code-switching, and automatic generation issues. It is randomly sampled from all public messages, avoiding biases towards pre-existing language classifiers. Second, we find that a demographic language model—which identifies messages with language similar to that used by several U.S. ethnic populations on Twitter—can be used to improve English language identification performance when combined with a traditional supervised language identifier. It increases recall with almost no loss of precision, including, surprisingly, for English messages written by non-U.S. authors. Our dataset and identifier ensemble are available online.

Highlights

  • Introduction and Related WorkLanguage identification is the task of determining the major world language a document is written in

  • Compounding the challenge is domain mismatch: the types of casual language, dialectal language, and Internet-specific constructs found in social media are often not present in the standardized genres of training data for existing language identifiers

  • This is potentially especially problematic for language by minority dialect speakers— for example, Blodgett et al (2016) found that current language identification models had lower recall for tweets written in African-American English (AAE) than those in standard English

Read more

Summary

Introduction

Introduction and Related WorkLanguage identification is the task of determining the major world language a document is written in. Compounding the challenge is domain mismatch: the types of casual language, dialectal language, and Internet-specific constructs found in social media are often not present in the standardized genres of training data for existing language identifiers This is potentially especially problematic for language by minority dialect speakers— for example, Blodgett et al (2016) found that current language identification models had lower recall for tweets written in African-American English (AAE) than those in standard English. In this work we maintain the paradigm of treating English as a broad language category, but propose that the texts

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.