Segmentation for Efficient Supervised Language Annotation with an Explicit Cost-Utility Tradeoff

Matthias Sperber,Satoshi Nakamura,Alex Waibel,Graham Neubig,Mirjam Simantzik

doi:10.1162/tacl_a_00174

Abstract

In this paper, we study the problem of manually correcting automatic annotations of natural language in as efficient a manner as possible. We introduce a method for automatically segmenting a corpus into chunks such that many uncertain labels are grouped into the same chunk, while human supervision can be omitted altogether for other segments. A tradeoff must be found for segment sizes. Choosing short segments allows us to reduce the number of highly confident labels that are supervised by the annotator, which is useful because these labels are often already correct and supervising correct labels is a waste of effort. In contrast, long segments reduce the cognitive effort due to context switches. Our method helps find the segmentation that optimizes supervision efficiency by defining user models to predict the cost and utility of supervising each segment and solving a constrained optimization problem balancing these contradictory objectives. A user study demonstrates noticeable gains over pre-segmented, confidence-ordered baselines on two natural language processing tasks: speech transcription and word segmentation.

Highlights

Many natural language processing (NLP) tasks require human supervision to be useful in practice, be it to collect suitable training material or to meet some desired output quality
We compare three scenarios: A baseline simulation, in which the baseline segments are transcribed in ascending order of confidence; a simulation using the proposed method, in which we change the word error rate (WER) constraint in small increments; an oracle simulation, which uses
We note that the WER estimation by our utility model was off by about 2.5%: While the predicted improvement in WER was from 22.33% to 15.0%, the actual improvement was from 19.96% to about 12.5%

Summary

Introduction

Many natural language processing (NLP) tasks require human supervision to be useful in practice, be it to collect suitable training material or to meet some desired output quality. Sentences with the lowest confidence are used as the data to be annotated (Figure 1 (a)). It has been noted that when the NLP system in question already has relatively high accuracy, annotating entire sentences can be wasteful, as most words will already be correct (Tomanek and Hahn, 2009; Neubig et al, 2011). In these cases, it is possible to achieve much higher benefit per annotated word by annotating sub-sentential units (Figure 1 (b))

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Transactions of the Association for Computational Linguistics	Publication Date: Dec 1, 2014
Citations: 36	License type: cc-by

R Discovery Prime

R Discovery Prime

Segmentation for Efficient Supervised Language Annotation with an Explicit Cost-Utility Tradeoff

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Transactions of the Association for Computational Linguistics

Lead the way for us

Similar Papers

Continuous Bangla Speech Processing: Segmentation, Classification and Recognition
Dr M M Rahman
-
Dr M M RahmanDr M M Rahman
23 Feb 2022
23 Feb 2022

New Words Discovery Method Based On Word Segmentation Result
Heyang Liu ... Pengdong Gao
-
Heyang Liu, et. al.Heyang Liu ... Pengdong Gao
01 Jun 2018
01 Jun 2018

Word Embedding for Bengali Language using Domain-related Corpus
Ashutosh Bandyopadhyay ... Jayashree Nair
-
Ashutosh Bandyopadhyay, et. al.Ashutosh Bandyopadhyay ... Jayashree Nair
26 Apr 2023
26 Apr 2023

Automatic Extraction of Comprehensive Drug Safety Information from Adverse Drug Event Narratives in the Korea Adverse Event Reporting System Using Natural Language Processing Techniques.
Siun Kim ... Yesol Hong
Drug Safety | VOL. 46
Siun Kim, et. al.Siun Kim ... Yesol Hong
17 Jun 2023
Drug Safety | VOL. 46

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Segmentation for Efficient Supervised Language Annotation with an Explicit Cost-Utility Tradeoff

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Transactions of the Association for Computational Linguistics