Learner Corpora without Error Tagging

Stefano Rastelli

doi:10.13092/lo.38.507

Abstract

The article explores the possibility of adopting a form-to-function perspective when annotating learner corpora in order to get deeper insights about systematic features of interlanguage. A split between forms and functions (or categories) is desirable in order to avoid the "comparative fallacy" and because – especially in basic varieties – forms may precede functions (e.g., what resembles to a "noun" might have a different function or a function may show up in unexpected forms). In the computer-aided error analysis tradition, all items produced by learners are traced to a grid of error tags which is based on the categories of the target language. Differently, we believe it is possible to record and make retrievable both words and sequence of characters independently from their functional-grammatical label in the target language. For this purpose at the University of Pavia we adapted a probabilistic POS tagger designed for L1 on L2 data. Despite the criticism that this operation can raise, we found that it is better to work with "virtual categories" rather than with errors. The article outlines the theoretical background of the project and shows some examples in which some potential of SLA-oriented (non error-based) tagging will be possibly made clearer.

Highlights

The article explores the possibility of adopting a form-to-function perspective when annotating learner corpora in order to get deeper insights about systematic features of interlanguage
It is believed that L1 taggers are useless because they are unable to capture the divergent phenomena occurring in learner corpora (LC)
The first one, which is widely accepted in literature and which is adopted in many European projects is that learner data is best viewed in terms of errors

Summary

POS annotation and error tagging

The topic of this article is the Part-of-Speech (POS) annotation of learner corpora (LC). The research question is whether it is feasible and convenient to instruct an automatic tagger which is capable of recognizing and annotating the grammatical categories in learner data. Misspelled, badly uttered, incomprehensible and not interpretable items are destined to escape the formal requirements of automatic analyzers and of robust parsers. To face this issue, two different solutions are at hand. The POS errortagging procedure is made up of three steps: (a) collecting learners' typical mistakes all together in a list (typical mistakes/errors with respect to homogeneous groups of learners); (b) turning this list into errors related to traditional linguistic categories (such as errors in nouns, adjectives, verbs etc); (c) tagging the items in the list using a markup language (for instance, XML). After a LC has been tagged with error-tags by using, for instance, a markup language, all occurrences are retrievable with software (for instance, with WordSmith Tools or Xaira)

Surface phenomena and acquisitional facts

SLA tagging and the rules of interlanguage

The risk of comparative fallacy

Running a L1 tagger on L2 data

L2 researchers can take advantage of how the Treetagger works

The unexpected data

Future research

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Linguistik Online	Publication Date: Apr 1, 2009
Citations: 9	License type: CC BY 3.0

R Discovery Prime

R Discovery Prime

Learner Corpora without Error Tagging

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Linguistik Online

Lead the way for us

Similar Papers

臺灣大學生過度使用英文「存現句結構」問題之研究

-

01 Jan 2006
01 Jan 2006

On the basis of the Basic Variety...
Bonnie D Schwartz
Second Language Research | VOL. 13
Bonnie D SchwartzBonnie D Schwartz
01 Oct 1997
Second Language Research | VOL. 13

Illocutionary Act and Communication with Reference to Translation
...
Qalaai Zanist Scientific Journal | VOL. 5
, et. al. ...
30 Sep 2020
Qalaai Zanist Scientific Journal | VOL. 5

Funktionale und stilistische Merkmale gesprochener fortgeschrittener Lerner:innensprache: Methodische und konzeptionelle Überlegungen am Beispiel von GeWiss
Christian Fandrych ... Franziska Wallner
Zeitschrift für germanistische Linguistik | VOL. 50
Christian Fandrych, et. al.Christian Fandrych ... Franziska Wallner
20 Apr 2022
Zeitschrift für germanistische Linguistik | VOL. 50

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Learner Corpora without Error Tagging

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Linguistik Online