Advanced Corpus Annotation Strategies for NLP. Applications in Automatic Summarization and Text Classiﬁcation

Christopher Tauchmann

doi:10.26083/tuprints-00017576

Abstract

Natural Language Processing (NLP) methods demand elaborate strategies for the creation of corpora that are fundamental to well-working NLP systems. In this thesis, we present diﬀerent corpus creation strategies and application scenarios for diﬀerent NLP tasks and show how they can beneﬁt a task. One focus lies on automatic summarization and summary evaluation, and the other on corpus creation for text classiﬁcation tasks. To this end, in the ﬁrst part of the thesis we provide the necessary background on corpus annotation for such an analysis: Chapter 2 details research on corpus annotation theory and annotation practices in diﬀerent disciplines such as Corpus Linguistics, and Computational Linguistics/Natural Language Processing (NLP). It also introduces the crowdsourcing approach to language annotations. Chapter 3 shows how diﬀerent annotator populations annotate datasets with diﬀerent annotation strategies. These strategies combine human and machine input. Chapter 4 details the background and historical overview of the foundations on automatic summarization and summary evaluation. We show that automatic summarization is a challenging NLP task and highlight the limiting focus in research on short English newswire datasets in research which can lead to rather skewed results. The second part deals with speciﬁc application scenarios in automatic summarization and summary evaluation. Chapter 5 describes the creation of a hierarchical summarization dataset. This dataset addresses two limitations in research: the focus on news datasets is enhanced with heterogeneous documents, and the source documents for the summaries are longer. Our research makes use of both crowdworkers and expert annotators, and shows how the strengths of both populations can be meaningfully combined in a larger corpus. Chapter 6 presents how research can beneﬁt from the extension of an existing heterogeneous summarization corpus from the educational domain with a range of further topics from this domain. Furthermore, we introduce an evaluation of summarization diﬃculty using heterogeneity estimators based on measures from information theory and cosine similarity. Chapter 7 outlines the creation of a summary evaluation corpus with annotations of a content-based evaluation metric, the Pyramid method. We apply an existing automatic method to create the Pyramids on the same corpus and show that they correspond well to manual expert Pyramids. In the third part, the focus lies on general corpus creation illustrated by two other tasks which are both machine learning (ML)-oriented. Chapter 8 describes a crowdsourcing method to annotate items based on measuring input data complexity with measures from language learning, NLP, and information theory. We create diﬀerent subsets of data that also function to train and ﬁlter crowdworkers. We test the method on an existing three-class sentence classiﬁcation dataset from argument mining and show that our method needs fewer annotators to achieve the same inter-annotator agreement than randomly distributed dataset portions. Chapter 9 presents the creation of a dataset that includes discourse conventions in texts from the social sciences that concern the topic of Artiﬁcial Intelligence (AI). The dataset consists of subsets of data from diﬀerent domains: software development, research paper abstracts, and online discussions. We annotate the dataset with expert active learning, where the ML model ‘‘asks’’ for annotations on certain items. Moreover, we evaluate the conventions that an ML model predicts and explain why these conventions can be detected correctly by the model.

Full Text