Abstract

In this thesis, I investigate and develop methods for automatically analyzing and assessing German syntactic structures in domain-specific texts. As domain-specific texts, I use Swiss German-language law texts. The automatic annotation of syntactic structures has long been studied in the research on natural language processing. Supervised statistical methods are regarded as state-of-the-art parsing methods, which are accurate but biased by the type of text. Consequently, the accuracy of statistical parsers decreases if they are used on domain-specific texts. The problem of domain bias in syntactic annotation should be solved if it directly affects the accuracy of an application. The syntactic assessment that I develop in this thesis is such an application that requires high accuracy of syntactic annotation. An effective solution to this problem would be the manual annotation of a large portion of the required domain texts. However, it is not feasible in practice because manual linguistic annotation is extremely labor intensive. To overcome this problem, I develop syntactic annotation methods that do not require the manual annotation of a large portion of the domain texts. The goal of this thesis is that the annotation accuracy on domain-specific texts is so high that it can be used for the application. For the automatic syntactic assessment, I demonstrate a novel approach to model domain-specific style choice by combining rule-based and statistical methods. In the rule-based approach, I present a method that automatically detects the violations of style rules in legislative style guidelines. In the statistical approach, domain-specific writing style is defined in terms of stylistic choice between syntactic alternations. The syntactic selection is statistically modeled by classifying syntactic alternatives according to their syntactic complexity. The syntactic assessment requires automatic syntactic annotation. For the automatic syntactic annotation, I present a linguistically motivated hybrid supertagger that analyzes topological dependency grammar relations in the German language. In this thesis, supertagging problems are seen as morphosyntactic ambiguity and syntactic resolution. Depending on the linguistic phenomena, the ambiguity is resolved by applying a rule-based and statistical tagging method: Morphological and syntactic hard constraints are applied in a constraint grammar approach. In contrast, lexical, semantic, and pragmatic soft and multivariate constraints are integrated into a conditional random fields model. The main contribution of this thesis to the study of natural language processing is to show that a linguistically motivated annotation method is a viable approach to achieving a high performance of syntactic analysis with a few hundreds of manually annotated sentences from the domain.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call