Abstract

We present a large-scale Native Language Identification (NLI) experiment on new data, with a focus on cross-corpus evaluation to identify corpusand genre-independent language transfer features. We test a new corpus and show it is comparable to other NLI corpora and suitable for this task. Cross-corpus evaluation on two large corpora achieves good accuracy and evidences the existence of reliable language transfer features, but lower performance also suggests that NLI models are not completely portable across corpora. Finally, we present a brief case study of features distinguishing Japanese learners’ English writing, demonstrating the presence of cross-corpus and cross-genre language transfer features that are highly applicable to SLA and ESL research.

Highlights

  • Native Language Identification, the task of determining the native language (L1) of an author based on a second language (L2) text, has received much attention recently

  • In this work we presented the first application of one of the largest and newest publicly available learner corpora to Native Language Identification (NLI)

  • Cross-validation experiments mirrored the performance of other corpora and demonstrated its utility for the task. We believe this will motivate future work by equipping researchers with a large-scale corpus that is highly suitable for NLI

Read more

Summary

Introduction

Native Language Identification, the task of determining the native language (L1) of an author based on a second language (L2) text, has received much attention recently. Some researchers have shifted their focus to developing data-driven methods for the automatic extraction and ranking of linguistic features that distinguish specific L1s (Swanson and Charniak, 2014). Such methods could be used to confirm existing SLA hypotheses, and to create new ones. This hypothesis formulation is an inherently difficult problem requiring copious amounts of data Contrary to this requirement, researchers have long noted the paucity of suitable corpora for this task (Brooke and Hirst, 2011). While it is the largest NLI dataset available, it only contains argumentative essays, limiting analyses to this genre

Objectives
Methods
Findings
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call