Abstract

Spelling variation in non-standard language, e.g. computer-mediated communication and historical texts, is usually treated as a deviation from a standard spelling, e.g. 2mr as an non-standard spelling for tomorrow. Consequently, in normalization – the standard approach of dealing with spelling variation – so-called non-standard words are mapped to their corresponding standard words. However, there is not always a corresponding standard word. This can be the case for single types (like emoticons in computer-mediated communication) or a complete language, e.g. texts from historical languages that did not develop to a standard variety. The approach presented in this thesis proposal deals with spelling variation in absence of reference to a standard. The task is to detect pairs of types that are variants of the same morphological word. An approach for spelling-variant detection is presented, where pairs of potential spelling variants are generated with Levenshtein distance and subsequently filtered by supervised machine learning. The approach is evaluated on historical Low German texts. Finally, further perspectives are discussed.

Highlights

  • Spelling variation is a well-known feature of nonstandard language, e.g. computer-mediated communication (CMC) and historical texts (Baron et al, 2009; Eisenstein, 2013)

  • The predominant way to deal with spelling variation is normalization, i.e. non-standard words are mapped to a corresponding standard word, or a canonical form

  • Recall and F-score for the set of spelling-variant pairs extracted from the test set that did not appear in the training data

Read more

Summary

Introduction

Spelling variation is a well-known feature of nonstandard language, e.g. computer-mediated communication (CMC) and historical texts (Baron et al, 2009; Eisenstein, 2013). We pursue an alternative approach: the task of spellingvariant detection, i.e. instead of mapping nonstandard or historical words to a standard form as in normalization, the aim is to detect spelling variants in a set of types without reference to a canonical form. This task can be applied in cases where no canonical form exists. An approach for detecting spelling variants is presented and evaluated on Middle Low German (GML), a group of German dialects from between 1200 and 1650 These dialects developed into Low German, a dialect group of German that has not undergone standardization. Proceedings of the Student Research Workshop at the 15th Conference of the European Chapter of the Association for Computational Linguistics, pages 11–22, Valencia, Spain, April 3-7 2017. c 2017 Association for Computational Linguistics

Related work
Defining spelling variation and spelling-variant detection
Approaches towards spelling-variant detection
Candidate-pair generation
Filtering overgenerated candidate pairs
Conclusion and future work
Improving spelling-variant detection
Extending the scope
Findings
Applications
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call