Cost-efficient quality assurance of natural language processing tools through continuous monitoring with continuous integration

Bodo Kraft,Albert Zündorf,Marc Schreiber

doi:10.1145/2897022.2897029

Abstract

More and more modern applications make use of natural language data, e. g. Information Extraction (IE) or Question Answering (QA) systems. Those application require preprocessing through Natural Language Processing (NLP) pipelines, and the output quality of these applications depends on the output quality of NLP pipelines. If NLP pipelines are applied in different domains, the output quality decreases and the application requires domain specific NLP training to improve the output quality.Adapting NLP tools to specific domains is a time-consuming and expensive task, inducing two key questions: a) how many documents need to be annotated to reach good output quality and b) what NLP tools build the best performing NLP pipeline? In this paper we demonstrate a monitoring system based on principles of Continuous Integration which addresses those questions and guides IE or QA application developers to build high quality NLP pipelines in a cost-efficient way. This monitoring system is based on many common tools, used in many software engineering projects.

Full Text