Abstract

Space-delimited words in Turkish and Hebrew text can be further segmented into meaningful units, but syntactic and semantic context is necessary to predict segmentation. At the same time, predicting correct syntactic structures relies on correct segmentation. We present a graph-based lattice dependency parser that operates on morphological lattices to represent different segmentations and morphological analyses for a given input sentence. The lattice parser predicts a dependency tree over a path in the lattice and thus solves the joint task of segmentation, morphological analysis, and syntactic parsing. We conduct experiments on the Turkish and the Hebrew treebank and show that the joint model outperforms three state-of-the-art pipeline systems on both data sets. Our work corroborates findings from constituency lattice parsing for Hebrew and presents the first results for full lattice parsing on Turkish.

Highlights

  • Linguistic theory has provided examples from many different languages in which grammatical information is expressed via case marking, morphological agreement, or clitics

  • For Hebrew, the baseline is the disambiguated lattices provided by the SPMRL 2014 Shared Task

  • The IGeval metric is designed to evaluate the syntactic quality with less attention to morphological analysis and segmentation. Both PIPELINE and JOINT achieve very similar results and none of the differences is statistical significant. These results suggest that a good part of the improvements in the lattice parser occurs in the morphological analysis/segmentation, whereas the quality of syntactic annotation basically stays the same between the pipeline and the joint model

Read more

Summary

Introduction

Linguistic theory has provided examples from many different languages in which grammatical information is expressed via case marking, morphological agreement, or clitics. In these languages, configurational information is less important than in English since the words are overtly marked for their syntactic relations to each other. Configurational information is less important than in English since the words are overtly marked for their syntactic relations to each other Such morphologically rich languages pose many new challenges to today’s natural language processing technology, which has often been developed for English. One of the first challenges is the question on how to represent morphologically rich languages and what are the basic units of analysis (Tsarfaty et al, 2010). A space-delimited word in the treebank can consist of several morphemes that may belong to independent syntactic contexts

Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.