SEMI-AUTOMATIC EXTRACTION OF LINGUISTIC INFORMATION FOR SYNTACTIC DISAMBIGUATION

Roberto Basili,Paola Velardi,Maria Teresa Pazienza

doi:10.1080/08839519308949994

Abstract

Abstract The robustness of NLP techniques can be improved by the use of “shallow” methods such as statistical analysis in combination with traditional knowledge-based methods, such as syntax and semantics This paper describes a hybrid methodology to extract from corpora preference criteria for syntactically ambiguous structures. The method is based on the statistical analysis of word co-occurrences augmented with syntactic and semantic tags, which we call clustered association data. The proposed method is shown to exhibit a better trade-off between precision of the acquired data and the amount of manual work required, with respect to other similar algorithms proposed in the literature. Furthermore, the use of semantic tags makes it possible to obtain a statistically relevant number of reliable data even when the application corpus.does not exceed 500,000 words.

Full Text