Spiral construction of syntactically annotated spoken language corpus

T Ohno,S Matsuhara,Y Inagaki,N Kawaguchi

doi:10.1109/nlpke.2003.1275953

Abstract

Spontaneous speech includes a broad range of linguistic phenomena characteristic of spoken language, and therefore a statistical approach would be effective for robust parsing of spoken language. Though a large-scale syntactically annotated corpus is required for the stochastic parsing, its construction requires a lot of human resources. We propose a method of efficiently constructing a spoken language corpus for which the dependency analysis is provided. This method uses an existing spoken language corpus. A stochastic dependency parse is employed to tag spoken language sentences with the dependency structures, and the results are corrected manually. The tagged corpus is constructed in a spiral fashion where in the corrected data is utilized as the statistical information for automatic parsing of other data. Taking this spiral approach reduces the parsing errors, also allowing us to reduce the correction cost. An experiment using 10995 Japanese utterances shows the spiral approach to be effective for efficient corpus construction.

Full Text