Abstract

Natural Language Processing (NLP) methods demand elaborate strategies for the creation of corpora that are fundamental to well-working NLP systems. In this thesis, we present different corpus creation strategies and application scenarios for different NLP tasks and show how they can benefit a task. One focus lies on automatic summarization and summary evaluation, and the other on corpus creation for text classification tasks. To this end, in the first part of the thesis we provide the necessary background on corpus annotation for such an analysis: Chapter 2 details research on corpus annotation theory and annotation practices in different disciplines such as Corpus Linguistics, and Computational Linguistics/Natural Language Processing (NLP). It also introduces the crowdsourcing approach to language annotations. Chapter 3 shows how different annotator populations annotate datasets with different annotation strategies. These strategies combine human and machine input. Chapter 4 details the background and historical overview of the foundations on automatic summarization and summary evaluation. We show that automatic summarization is a challenging NLP task and highlight the limiting focus in research on short English newswire datasets in research which can lead to rather skewed results. The second part deals with specific application scenarios in automatic summarization and summary evaluation. Chapter 5 describes the creation of a hierarchical summarization dataset. This dataset addresses two limitations in research: the focus on news datasets is enhanced with heterogeneous documents, and the source documents for the summaries are longer. Our research makes use of both crowdworkers and expert annotators, and shows how the strengths of both populations can be meaningfully combined in a larger corpus. Chapter 6 presents how research can benefit from the extension of an existing heterogeneous summarization corpus from the educational domain with a range of further topics from this domain. Furthermore, we introduce an evaluation of summarization difficulty using heterogeneity estimators based on measures from information theory and cosine similarity. Chapter 7 outlines the creation of a summary evaluation corpus with annotations of a content-based evaluation metric, the Pyramid method. We apply an existing automatic method to create the Pyramids on the same corpus and show that they correspond well to manual expert Pyramids. In the third part, the focus lies on general corpus creation illustrated by two other tasks which are both machine learning (ML)-oriented. Chapter 8 describes a crowdsourcing method to annotate items based on measuring input data complexity with measures from language learning, NLP, and information theory. We create different subsets of data that also function to train and filter crowdworkers. We test the method on an existing three-class sentence classification dataset from argument mining and show that our method needs fewer annotators to achieve the same inter-annotator agreement than randomly distributed dataset portions. Chapter 9 presents the creation of a dataset that includes discourse conventions in texts from the social sciences that concern the topic of Artificial Intelligence (AI). The dataset consists of subsets of data from different domains: software development, research paper abstracts, and online discussions. We annotate the dataset with expert active learning, where the ML model ‘‘asks’’ for annotations on certain items. Moreover, we evaluate the conventions that an ML model predicts and explain why these conventions can be detected correctly by the model.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.