Reliable Part-of-Speech Tagging of Historical Corpora through Set-Valued Prediction

Stefan Heid,Marcel Wever,Eyke Hüllermeier

doi:10.46298/jdmdh.6696

Abstract

Syntactic annotation of corpora in the form of part-of-speech (POS) tags is a key requirement for both linguistic research and subsequent automated natural language processing (NLP) tasks. This problem is commonly tackled using machine learning methods, i.e., by training a POS tagger on a sufficiently large corpus of labeled data. While the problem of POS tagging can essentially be considered as solved for modern languages, historical corpora turn out to be much more difficult, especially due to the lack of native speakers and sparsity of training data. Moreover, most texts have no sentences as we know them today, nor a common orthography. These irregularities render the task of automated POS tagging more difficult and error-prone. Under these circumstances, instead of forcing the POS tagger to predict and commit to a single tag, it should be enabled to express its uncertainty. In this paper, we consider POS tagging within the framework of set-valued prediction, which allows the POS tagger to express its uncertainty via predicting a set of candidate POS tags instead of guessing a single one. The goal is to guarantee a high confidence that the correct POS tag is included while keeping the number of candidates small. In our experimental study, we find that extending state-of-the-art POS taggers to set-valued prediction yields more precise and robust taggings, especially for unknown words, i.e., words not occurring in the training data.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Reliable Part-of-Speech Tagging of Historical Corpora through Set-Valued Prediction

Abstract

Talk to us

Similar Papers

More From: Journal of Data Mining & Digital Humanities

Lead the way for us

Journal: Journal of Data Mining & Digital Humanities	Publication Date: Oct 21, 2024
License type: cc-by

Similar Papers

A Ruled-Based Part of Speech (RPOS) Tagger for Malay Text Articles
Rayner Alfred ... Joe Henry Obit
-
Rayner Alfred, et. al.Rayner Alfred ... Joe Henry Obit
01 Jan 2013
01 Jan 2013

A Cascaded Unsupervised Model for PoS Tagging
Necva Bölücü ... Burcu Can
ACM Transactions on Asian and Low-Resource Language Information Processing | VOL. 20
Necva Bölücü, et. al.Necva Bölücü ... Burcu Can
31 Jan 2021
ACM Transactions on Asian and Low-Resource Language Information Processing | VOL. 20

Hidden Markov Model based Part of Speech Tagging for Nepali language
Abhijit Paul ... Bipul Syam Purkayastha
-
Abhijit Paul, et. al.Abhijit Paul ... Bipul Syam Purkayastha
01 Sep 2015
01 Sep 2015

Combination of Genetic Algorithm and Brill Tagger Algorithm for Part of Speech Tagging Bahasa Madura
Nindian Puspa Dewi ... Ubaidi Ubaidi
Proceeding of the Electrical Engineering Computer Science and Informatics | VOL. 7
Nindian Puspa Dewi, et. al.Nindian Puspa Dewi ... Ubaidi Ubaidi
01 Oct 2020
Proceeding of the Electrical Engineering Computer Science and Informatics | VOL. 7

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Reliable Part-of-Speech Tagging of Historical Corpora through Set-Valued Prediction

Abstract

Talk to us

Similar Papers

More From: Journal of Data Mining &amp; Digital Humanities

More From: Journal of Data Mining & Digital Humanities